Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-08 Thread Gregory Farnum
On Wed, Jan 7, 2015 at 9:55 PM, Christian Balzer ch...@gol.com wrote:
 On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote:

 On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote:

  However, I suspect that temporarily setting min size to a lower number
  could be enough for the PGs to recover.  If ceph osd pool pool set
  min_size 1 doesn't get the PGs going, I suppose restarting at least
  one of the OSDs involved in the recovery, so that they PG undergoes
  peering again, would get you going again.
 

 It depends on how incomplete your incomplete PGs are.

 min_size is defined as Sets the minimum number of replicas required for
 I/O..  By default, size is 3 and min_size is 2 on recent versions of
 ceph.

 If the number of replicas you have drops below min_size, then Ceph will
 mark the PG as incomplete.  As long as you have one copy of the PG, you
 can recover by lowering the min_size to the number of copies you do
 have, then restoring the original value after recovery is complete.  I
 did this last week when I deleted the wrong PGs as part of a toofull
 experiment.

 Which of course begs the question of why not having min_size at 1
 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
 same time your cluster still keeps working (as it should with a size of 3).

You no longer have write durability if you only have one copy of a PG.

Sam is fixing things up so that recovery will work properly as long as
you have a whole copy of the PG, which should make things behave as
people expect.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Documentation of ceph pg num query

2015-01-09 Thread Gregory Farnum
On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann
christian.eichelm...@1und1.de wrote:
 Hi all,

 as mentioned last year, our ceph cluster is still broken and unusable.
 We are still investigating what has happened and I am taking more deep
 looks into the output of ceph pg pgnum query.

 The problem is that I can find some informations about what some of the
 sections mean, but mostly I can only guess. Is there any kind of
 documentation where I can find some explanations of whats state there?
 Because without that the output is barely usefull.

There is unfortunately not really documentation around this right now.
If you have specific questions someone can probably help you with
them, though.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniform distribution

2015-01-09 Thread Gregory Farnum
100GB objects (or ~40 on a hard drive!) are way too large for you to
get an effective random distribution.
-Greg

On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 01/08/2015 03:35 PM, Michael J Brewer wrote:

 Hi all,

 I'm working on filling a cluster to near capacity for testing purposes.
 Though I'm noticing that it isn't storing the data uniformly between
 OSDs during the filling process. I currently have the following levels:

 Node 1:
 /dev/sdb1  3904027124  2884673100  1019354024  74%
 /var/lib/ceph/osd/ceph-0
 /dev/sdc1  3904027124  2306909388  1597117736  60%
 /var/lib/ceph/osd/ceph-1
 /dev/sdd1  3904027124  3296767276   607259848  85%
 /var/lib/ceph/osd/ceph-2
 /dev/sde1  3904027124  3670063612   233963512  95%
 /var/lib/ceph/osd/ceph-3

 Node 2:
 /dev/sdb1  3904027124  3250627172   653399952  84%
 /var/lib/ceph/osd/ceph-4
 /dev/sdc1  3904027124  3611337492   292689632  93%
 /var/lib/ceph/osd/ceph-5
 /dev/sdd1  3904027124  2831199600  1072827524  73%
 /var/lib/ceph/osd/ceph-6
 /dev/sde1  3904027124  2466292856  1437734268  64%
 /var/lib/ceph/osd/ceph-7

 I am using rados put to upload 100g files to the cluster, doing two at
 a time from two different locations. Is this expected behavior, or can
 someone shed light on why it is doing this? We're using the opensource
 version 80.7. We're also using the default CRUSH configuration.


 So crush utilizes pseudo-random distributions, but sadly random
 distributions tend to be clumpy and not perfectly uniform until you get to
 very high sample counts. The gist of it is that if you have a really low
 density of PGs/OSD and/or are very unlucky, you can end up with a skewed
 distribution.  If you are even more unlucky, you could compound that with a
 streak of objects landing on PGs associated with some specific OSD.  This
 particular case looks rather bad.  How many PGs and OSDs do you have?


 Regards,
 *MICHAEL J. BREWER*
 
 *Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596*
 E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com


 11501 Burnet Rd
 Austin, TX 78758-3400
 United States




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on peta scale

2015-01-09 Thread Gregory Farnum
On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote:
 I just finished configuring ceph up to 100 TB with openstack ... Since we
 are also using Lustre in our HPC machines , just wondering what is the
 bottle neck in ceph going on Peta Scale like Lustre .

 any idea ? or someone tried it

If you're talking about people building a petabyte Ceph system, there
are *many* who run clusters of that size. If you're talking about the
Ceph filesystem as a replacement for Lustre at that scale, the concern
is less about the raw amount of data and more about the resiliency of
the current code base at that size...but if you want to try it out and
tell us what problems you run into we will love you forever. ;)
(The scalable file system use case is what actually spawned the Ceph
project, so in theory there shouldn't be any serious scaling
bottlenecks. In practice it will depend on what kind of metadata
throughput you need because the multi-MDS stuff is improving but still
less stable.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Gregory Farnum
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius
nico-ceph-us...@schottelius.org wrote:
 Lionel, Christian,

 we do have the exactly same trouble as Christian,
 namely

 Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
 We still don't know what caused this specific error...

 and

 ...there is currently no way to make ceph forget about the data of this pg 
 and create it as an empty one. So the only way
 to make this pool usable again is to loose all your data in there.

 I wonder what is the position of ceph developers regarding
 dropping (emptying) specific pgs?
 Is that a use case that was never thought of or tested?

I've never worked directly on any of the cluster this has happened to,
but I believe every time we've seen issues like this with somebody we
have a relationship with it's either:
1) been resolved by using the existing tools to stuff lost, or
2) been the result of local filesystems/disks silently losing data due
to some fault or other.

The second case means the OSDs have corrupted state and trusting them
is tricky. Also, most people we've had relationships with that this
has happened to really want to not lose all the data in the PG, which
necessitates manually mucking around anyway. ;)

Mailing list issues are obviously a lot harder to categorize, but the
ones we've taken time on where people say the commands don't work have
generally fallen into the second bucket.

If you want to experiment, I think all the manual mucking around has
been done with the objectstore tool and removing bad PGs, moving them
around, or faking journal entries, but I've not done it myself so I
could be mistaken.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on peta scale

2015-01-12 Thread Gregory Farnum
On Mon, Jan 12, 2015 at 3:55 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote:
 Thanks Greg, No i am more into large scale RADOS system not filesystem .

 however for geographic distributed datacentres specially when network
 flactuate how to handle that as i read it seems CEPH need big pipe of
 network

Ceph isn't really suited for WAN-style distribution. Some users have
high-enough and consistent-enough bandwidth (with low enough latency)
to do it, but otherwise you probably want to use Ceph within the data
centers and layer something else on top of it.
-Greg


 /Zee

 On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se
 wrote:
  I just finished configuring ceph up to 100 TB with openstack ... Since
  we
  are also using Lustre in our HPC machines , just wondering what is the
  bottle neck in ceph going on Peta Scale like Lustre .
 
  any idea ? or someone tried it

 If you're talking about people building a petabyte Ceph system, there
 are *many* who run clusters of that size. If you're talking about the
 Ceph filesystem as a replacement for Lustre at that scale, the concern
 is less about the raw amount of data and more about the resiliency of
 the current code base at that size...but if you want to try it out and
 tell us what problems you run into we will love you forever. ;)
 (The scalable file system use case is what actually spawned the Ceph
 project, so in theory there shouldn't be any serious scaling
 bottlenecks. In practice it will depend on what kind of metadata
 throughput you need because the multi-MDS stuff is improving but still
 less stable.)
 -Greg




 --

 Regards

 Zeeshan Ali Shah
 System Administrator - PDC HPC
 PhD researcher (IT security)
 Kungliga Tekniska Hogskolan
 +46 8 790 9115
 http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reset osd perf counters

2015-01-12 Thread Gregory Farnum
perf reset on the admin socket. I'm not sure what version it went in
to; you can check the release logs if it doesn't work on whatever you
have installed. :)
-Greg


On Mon, Jan 12, 2015 at 2:26 PM, Shain Miley smi...@npr.org wrote:
 Is there a way to 'reset' the osd perf counters?

 The numbers for osd 73 though osd 83 look really high compared to the rest
 of the numbers I see here.

 I was wondering if I could clear the counters out, so that I have a fresh
 set of data to work with.


 root@cephmount1:/var/log/samba# ceph osd perf
 osdid fs_commit_latency(ms) fs_apply_latency(ms)
 0 0   45
 1 0   14
 2 0   47
 3 0   25
 4 1   44
 5 12
 6 12
 7 0   39
 8 0   32
 9 0   34
10 2  186
11 0   68
12 11
13 0   34
14 01
15 2   37
16 0   23
17 0   28
18 0   26
19 0   22
20 02
21 2   24
22 0   33
23 01
24 3   98
25 2   70
26 01
27 3   99
28 02
29 2  101
30 2   72
31 2   81
32 3  112
33 3   94
34 4  152
35 0   56
36 02
37 2   58
38 01
39 03
40 02
41 02
42 11
43 02
44 1   44
45 02
46 01
47 3   85
48 01
49 2   75
50 4  398
51 3  115
52 01
53 2   47
54 6  290
55 5  153
56 7  453
57 2   66
58 11
59 5  196
60 00
61 0   93
62 09
63 01
64 01
65 04
66 01
67 0   18
68 0   16
69 0   81
70 0   70
71 00
72 01
7374 1217
74 01
7564 1238
7692 1248
77 01
78 01
79   109 1333
8068 1451
8166 1192
8295 1215
8381 1331
84 3   56
85 3   65
86 01
87 3   55
88   

Re: [ceph-users] cephfs modification time

2015-01-12 Thread Gregory Farnum
Zheng, this looks like a kernel client issue to me, or else something
funny is going on with the cap flushing and the timestamps (note how
the reading client's ctime is set to an even second, while the mtime
is ~.63 seconds later and matches what the writing client sees). Any
ideas?
-Greg

On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote:
 Hi Gregory,


 $ uname -a
 Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64
 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux


 Kernel Client, using  `mount -t ceph ...`


 core@coreos2 /var/run/systemd/system $ modinfo ceph
 filename:   /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 alias:  fs-ceph
 depends:libceph
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256

 core@coreos2 /var/run/systemd/system $ modinfo libceph
 filename:   /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 depends:libcrc32c
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256



 ceph is installed on a ubuntu containers (same kernel):

 $ dpkg -l |grep ceph

 ii  ceph 0.87-1trusty
 amd64distributed storage and file system
 ii  ceph-common  0.87-1trusty
 amd64common utilities to mount and interact with a ceph
 storage cluster
 ii  ceph-fs-common   0.87-1trusty
 amd64common utilities to mount and interact with a ceph file
 system
 ii  ceph-fuse0.87-1trusty
 amd64FUSE-based client for the Ceph distributed file system
 ii  ceph-mds 0.87-1trusty
 amd64metadata server for the ceph distributed file system
 ii  libcephfs1   0.87-1trusty
 amd64Ceph distributed file system client library
 ii  python-ceph  0.87-1trusty
 amd64Python libraries for the Ceph distributed filesystem



 Reproducing the error:

 at machine 1:
 core@coreos1 /var/lib/deis/store/logs $  test.log
 core@coreos1 /var/lib/deis/store/logs $ echo 1  test.log
 core@coreos1 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.637234229 +
  Birth: -

 at machine 2:
 core@coreos2 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.0 +
  Birth: -


 Change time is not updated making some tail libs to not show new
 content until you force the change time be updated, like running a
 touch in the file.
 Some tools freeze and trigger other issues in the system.


 Tests, all in the machine #2:

 FAILED - https://github.com/ActiveState/tail
 FAILED - /usr/bin/tail of a Google docker image running debian wheezy
 PASSED - /usr/bin/tail of a ubuntu 14.04 docker image
 PASSED - /usr/bin/tail of the coreos release 494.5.0


 Tests in machine #1 (same machine that is writing the file) all tests pass.



 On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote:
 What versions of all the Ceph pieces are you using? (Kernel
 client/ceph-fuse, MDS, etc)

 Can you provide more details on exactly what the program is doing on
 which nodes?
 -Greg

 On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote:
 first 3 stat commands shows blocks and size changing, but not the times
 after a touch it changes and tail works

 I saw some cephfs freezes related to it, it came back after touching the 
 files

 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 148564 Blocks: 291IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:00.100582619

Re: [ceph-users] cephfs modification time

2015-01-14 Thread Gregory Farnum
Awesome, thanks for the bug report and the fix, guys. :)
-Greg

On Mon, Jan 12, 2015 at 11:18 PM, 严正 z...@redhat.com wrote:
 I tracked down the bug. Please try the attached patch

 Regards
 Yan, Zheng




 在 2015年1月13日,07:40,Gregory Farnum g...@gregs42.com 写道:

 Zheng, this looks like a kernel client issue to me, or else something
 funny is going on with the cap flushing and the timestamps (note how
 the reading client's ctime is set to an even second, while the mtime
 is ~.63 seconds later and matches what the writing client sees). Any
 ideas?
 -Greg

 On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote:
 Hi Gregory,


 $ uname -a
 Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64
 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux


 Kernel Client, using  `mount -t ceph ...`


 core@coreos2 /var/run/systemd/system $ modinfo ceph
 filename:   /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 alias:  fs-ceph
 depends:libceph
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256

 core@coreos2 /var/run/systemd/system $ modinfo libceph
 filename:   /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 depends:libcrc32c
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256



 ceph is installed on a ubuntu containers (same kernel):

 $ dpkg -l |grep ceph

 ii  ceph 0.87-1trusty
 amd64distributed storage and file system
 ii  ceph-common  0.87-1trusty
 amd64common utilities to mount and interact with a ceph
 storage cluster
 ii  ceph-fs-common   0.87-1trusty
 amd64common utilities to mount and interact with a ceph file
 system
 ii  ceph-fuse0.87-1trusty
 amd64FUSE-based client for the Ceph distributed file system
 ii  ceph-mds 0.87-1trusty
 amd64metadata server for the ceph distributed file system
 ii  libcephfs1   0.87-1trusty
 amd64Ceph distributed file system client library
 ii  python-ceph  0.87-1trusty
 amd64Python libraries for the Ceph distributed filesystem



 Reproducing the error:

 at machine 1:
 core@coreos1 /var/lib/deis/store/logs $  test.log
 core@coreos1 /var/lib/deis/store/logs $ echo 1  test.log
 core@coreos1 /var/lib/deis/store/logs $ stat test.log
  File: 'test.log'
  Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.637234229 +
 Birth: -

 at machine 2:
 core@coreos2 /var/lib/deis/store/logs $ stat test.log
  File: 'test.log'
  Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.0 +
 Birth: -


 Change time is not updated making some tail libs to not show new
 content until you force the change time be updated, like running a
 touch in the file.
 Some tools freeze and trigger other issues in the system.


 Tests, all in the machine #2:

 FAILED - https://github.com/ActiveState/tail
 FAILED - /usr/bin/tail of a Google docker image running debian wheezy
 PASSED - /usr/bin/tail of a ubuntu 14.04 docker image
 PASSED - /usr/bin/tail of the coreos release 494.5.0


 Tests in machine #1 (same machine that is writing the file) all tests pass.



 On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote:
 What versions of all the Ceph pieces are you using? (Kernel
 client/ceph-fuse, MDS, etc)

 Can you provide more details on exactly what the program is doing on
 which nodes?
 -Greg

 On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote:
 first 3 stat commands shows blocks and size changing, but not the times
 after a touch it changes and tail works

 I saw some cephfs freezes related to it, it came back after touching the 
 files

 coreos2 logs # stat deis

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-14 Thread Gregory Farnum
On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:

 On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote:

 On Mon, 12 Jan 2015, Dan Van Der Ster wrote:

 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.


 Sounds good to me.  Do you mind submitting a patch that prints a warning
 from either FileStore::_detect_fs()?  That will appear in the local
 ceph-osd.NNN.log.

 Alternatively, we should send something to the cluster log
 (osd-clog.warning()  ...) but if we go that route we need to be
 careful that the logger it up and running first, which (I think) rules out
 FileStore::_detect_fs().  It could go in OSD itself although that seems
 less clean since the recommendation probably doesn't apply when
 using a backend that doesn't use a file system…


 Sure, I’ll try to prepare a patch which warns but isn’t too annoying.
 MongoDB already solved the heuristic:

 https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp

 It’s licensed as AGPLv3 -- do you already know if we can borrow such code
 into Ceph?

https://www.gnu.org/licenses/license-list.html#AGPL

I've read that and the linked Affero Article 13 and I actually can't
tell if Ceph is safe to integrate or not, but I'm thinking no since
the servers are under LGPL. :/ Also I'm not sure if storage system
users qualify as remote users but I don't think we're going to print
an Affero string every time somebody runs a ceph tool. ;)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rbd] Ceph RBD kernel client using with cephx

2015-02-09 Thread Gregory Farnum
Unmapping is an operation local to the host and doesn't communicate
with the cluster at all (at least, in the kernel you're running...in
very new code it might involve doing an unwatch, which will require
communication). That means there's no need for a keyring, since its
purpose is to validate communication with the cluster.
-Greg

On Mon, Feb 9, 2015 at 6:58 AM, Vikhyat Umrao vum...@redhat.com wrote:
 Hi,

 While using rbd kernel client with cephx , admin user without admin keyring
 was not able to map the rbd image to a block device and this should be the
 work flow.

 But issue is once I unmap rbd image without admin keyring it is allowing to
 unmap the image and as per my understanding it should not be the case , it
 should not all and give error as when it has given while mapping.

 Is it a normal behaviour or I am missing something , may be needed a fix
 (bug) ?

 

 [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/
 total 16
 -rw-r--r--. 1 root root  63 Feb  9 22:30 ceph.client.admin.keyring
 -rw-r--r--. 1 root root  71 Feb  9 22:23 ceph.client.dell-per620-1.keyring
 -rw-r--r--. 1 root root 467 Feb  9 22:22 ceph.conf
 -rwxr-xr-x. 1 root root  92 Oct 15 01:03 rbdmap
 [ceph@dell-per620-1 ceph]$


 [ceph@dell-per620-1 ceph]$ sudo mv /etc/ceph/ceph.client.admin.keyring
 /tmp/.
 [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/
 total 12
 -rw-r--r--. 1 root root  71 Feb  9 22:23 ceph.client.dell-per620-1.keyring
 -rw-r--r--. 1 root root 467 Feb  9 22:22 ceph.conf
 -rwxr-xr-x. 1 root root  92 Oct 15 01:03 rbdmap
 [ceph@dell-per620-1 ceph]$

 [ceph@dell-per620-1 ceph]$ sudo rbd map testcephx
 rbd: add failed: (22) Invalid argument

 [ceph@dell-per620-1 ceph]$ sudo dmesg
 [437447.308705] libceph: no secret set (for auth_x protocol)
 [437447.308761] libceph: error -22 on auth protocol 2 init
 [437447.308809] libceph: client4954 fsid
 d57d909f-8adf-46aa-8cc6-3168974df332

 [ceph@dell-per620-1 ceph]$ sudo mv /tmp/ceph.client.admin.keyring /etc/ceph/
 [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/
 total 16
 -rw-r--r--. 1 root root  63 Feb  9 22:30 ceph.client.admin.keyring
 -rw-r--r--. 1 root root  71 Feb  9 22:23 ceph.client.dell-per620-1.keyring
 -rw-r--r--. 1 root root 467 Feb  9 22:22 ceph.conf
 -rwxr-xr-x. 1 root root  92 Oct 15 01:03 rbdmap

 [ceph@dell-per620-1 ceph]$ sudo rbd map testcephx

 [ceph@dell-per620-1 ceph]$ sudo rbd showmapped
 id pool image snap device
 0  rbd  testcephx -/dev/rbd0

 [ceph@dell-per620-1 ceph]$ sudo dmesg
 [437447.308705] libceph: no secret set (for auth_x protocol)
 [437447.308761] libceph: error -22 on auth protocol 2 init
 [437447.308809] libceph: client4954 fsid
 d57d909f-8adf-46aa-8cc6-3168974df332
 [437496.444701] libceph: client4961 fsid
 d57d909f-8adf-46aa-8cc6-3168974df332
 [437496.447833] libceph: mon1 10.65.200.118:6789 session established
 [437496.482913]  rbd0: unknown partition table
 [437496.483037] rbd: rbd0: added with size 0x800
 [ceph@dell-per620-1 ceph]$

 [ceph@dell-per620-1 ceph]$ sudo mv /etc/ceph/ceph.client.admin.keyring
 /tmp/.
 [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/
 total 12
 -rw-r--r--. 1 root root  71 Feb  9 22:23 ceph.client.dell-per620-1.keyring
 -rw-r--r--. 1 root root 467 Feb  9 22:22 ceph.conf
 -rwxr-xr-x. 1 root root  92 Oct 15 01:03 rbdmap

 [ceph@dell-per620-1 ceph]$ sudo rbd unmap /dev/rbd/rbd/testcephx
 --- If we see here it has allowed unmaping rbd image without
 keyring

 [ceph@dell-per620-1 ceph]$ sudo rbd showmapped --- no mapped image

 -

 Regards,
 Vikhyat












 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked 32 sec woes

2015-02-09 Thread Gregory Farnum
There are a lot of next steps on
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

You probably want to look at the bits about using the admin socket,
and diagnosing slow requests. :)
-Greg

On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco m...@monaco.cx wrote:
 Hello!

 *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I
 believe some of the members of your thesis committee were students of his =)

 We have a modest cluster at CU Boulder and are frequently plagued by requests
 are blocked issues. I'd greatly appreciate any insight or pointers. The issue
 is not specific to any one OSD; I'm pretty sure they've all showed up in ceph
 health detail at this point.

 We have 8 identical nodes:

 - 5 * 1TB Seagate enterprise SAS drives
   - btrfs
 - 1 * Intel 480G S3500 SSD
   - with 5*16G partitions as journals
   - also hosting the OS, unfortunately
 -  64G RAM
 - 2 * Xeon E5-2630 v2
   - So 24 hyperthreads @ 2.60 GHz
 - 10G-ish IPoIB for networking

 So the cluster has 40TB over 40 OSDs total with a very straightforward 
 crushmap.
 These nodes are also (unfortunately for the time being) OpenStack compute 
 nodes
 and 99% of the usage is OpenStack volumes/images. I see a lot of kernel 
 messages
 like:

 ib_mthca :02:00.0: Async event 16 for bogus QP 00dc0408

 which may or may not be correlated w/ the Ceph hangs.

 Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack
 volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, but
 came from an initial misunderstanding of the formula in the documentation.

 Thanks,
 Matt


 PS - I'm trying to secure funds to get an additional 8 nodes with a little 
 less
 RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA DOM for 
 the
 OS so the SSD will be strictly journal. I may even be able to get an 
 additional
 SSD or two per-node to use for caching or simply to set a higher primary 
 affinity


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Compilation problem

2015-02-09 Thread Gregory Farnum
On Fri, Feb 6, 2015 at 3:37 PM, David J. Arias david.ar...@getecsa.co wrote:
 Hello!

 I am sysadmin for a small IT consulting enterprise in México.

 We are trying to integrate three servers running RHEL 5.9 into a new
 CEPH cluster.

 I downloaded the source code and tried compiling it, though I got stuck
 with the requirements for leveldb and libblkid.

 The versions installed by the OS are behind the ones recommended so I am
 wondering if it is possible to compile updated ones from source, install
 them in another location (/usr/local/{} )and use those for CEPH.

 Upgrading the OS is (although not impossible) difficult since these are
 production servers which hold critical applications, and some of those
 are legacy ones :-(

 I tried googling around but had no luck as to how to accomplish
 this, ./configure --help doesn't show anyway and tried --system-root
 without success.

 I am following the instructions from:

 https://wiki.ceph.com/FAQs/What_Kind_of_OS_Does_Ceph_Require%3F
 http://docs.ceph.com/docs/master/install/install-storage-cluster/#installing-a-build
 http://docs.ceph.com/docs/master/install/#get-software
 http://wiki.ceph.com/FAQs

 The only data I've found so far although related doesn't really apply to
 my case:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041683.html
 http://article.gmane.org/gmane.comp.file-systems.ceph.user/3010/match=redhat+5.9

 Any help/ideas/pointers would be great.

I think there's ongoing work to backport (portions of?) Ceph to RHEL5,
but it definitely doesn't build out of the box. Even beyond the
library dependencies you've noticed you'll find more issues with e.g.
the boost and gcc versions. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph Performance vs PG counts

2015-02-08 Thread Gregory Farnum
On Sun, Feb 8, 2015 at 6:00 PM, Sumit Gaur sumitkg...@gmail.com wrote:
 Hi
 I have installed 6 node ceph cluster and doing a performance bench mark for
 the same using Nova VMs. What I have observed that FIO random write reports
 around 250 MBps for 1M block size and PGs 4096 and 650MBps for iM block size
 and PG counts 2048  . Can some body let me know if I am missing any ceph
 Architecture point here ? As per my understanding PG numbers are mainly
 involved in calculating the hash and should not effect performance so much.

PGs are also serialization points within the codebase, so depending on
how you're testing you can run into contention if you have multiple
objects within a single PG that you're trying to write to at once.
This isn't normally a problem, but for a single benchmark run the
random collisions can become noticeable.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked 32 sec woes

2015-02-09 Thread Gregory Farnum
On Mon, Feb 9, 2015 at 7:12 PM, Matthew Monaco m...@monaco.cx wrote:
 On 02/09/2015 08:20 AM, Gregory Farnum wrote:
 There are a lot of next steps on
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

 You probably want to look at the bits about using the admin socket, and
 diagnosing slow requests. :) -Greg

 Yeah, I've been through most of that. It's still been difficult to pinpoint
 what's causing the blocking. Can I get some clarification on this comment:

 Ceph acknowledges writes after journaling, so fast SSDs are an attractive
 option to accelerate the response time–particularly when using the ext4 or
 XFS filesystems. By contrast, the btrfs filesystem can write and journal
 simultaneously.

 Does this mean btrfs doesn't need separate journal partition/block device? 
 I.e.,
 is what ceph-disk does when creating with --fs-type btrfs entirely non-optimal
 (creates a 5G journal partition and the rest a btrfs partition).

 I just don't get the by contrast. If the OSD is btrfs+rotational, then why
 doesn't putting the journal on an SSD help (as much?) if writes are returned
 after journaling?

Yeah, that's not quite the best phrasing. btrfs' parallel journaling
can be a big advantage in all-spinner cases where under the right
kinds of load the filesystem actually has a chance of committing data
to disk faster than the journal does. There aren't many situations
where that's likely, though — it's more useful for direct librados
users who might want to proceed once data is readable rather than when
it's durable. That's not an option with xfs.
-Greg



 On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco m...@monaco.cx wrote:
 Hello!

 *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I
 believe some of the members of your thesis committee were students of his
 =)

 We have a modest cluster at CU Boulder and are frequently plagued by
 requests are blocked issues. I'd greatly appreciate any insight or
 pointers. The issue is not specific to any one OSD; I'm pretty sure
 they've all showed up in ceph health detail at this point.

 We have 8 identical nodes:

 - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD
  - with 5*16G partitions as journals - also hosting the OS, unfortunately
 - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish
 IPoIB for networking

 So the cluster has 40TB over 40 OSDs total with a very straightforward
 crushmap. These nodes are also (unfortunately for the time being)
 OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I
 see a lot of kernel messages like:

 ib_mthca :02:00.0: Async event 16 for bogus QP 00dc0408

 which may or may not be correlated w/ the Ceph hangs.

 Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack
  volumes pool has 4096 pgs and is sized 3. This is probably too many PGs,
 but came from an initial misunderstanding of the formula in the
 documentation.

 Thanks, Matt


 PS - I'm trying to secure funds to get an additional 8 nodes with a little
 less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA
 DOM for the OS so the SSD will be strictly journal. I may even be able to
 get an additional SSD or two per-node to use for caching or simply to set
 a higher primary affinity


 ___ ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSHMAP for chassis balance

2015-02-13 Thread Gregory Farnum
With sufficiently new CRUSH versions (all the latest point releases on
LTS?) I think you can simply have the rule return extra IDs which are
dropped if they exceed the number required. So you can choose two chassis,
then have those both choose to lead OSDs, and return those 4 from the rule.
-Greg
On Fri, Feb 13, 2015 at 6:13 AM Luke Kao luke@mycom-osi.com wrote:

  Dear cepher,

 Currently I am working on crushmap to try to make sure the at least one
 copy are going to different chassis.

 Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6.



 With replication =2, it’s not a problem, I can use the following step in
 rule

 step take chasses1

 step chooseleaf firstn 1 type host

 step emit

 step take chasses2

 step chooseleaf firstn 1 type host

 step emit



 But for replication=3, I tried

 step take chasses1

 step chooseleaf firstn 1 type host

 step emit

 step take chasses2

 step chooseleaf firstn 1 type host

 step emit

 step take default

 step chooseleaf firstn 1 type host

 step emit



 At the end, the 3rd osd returned in rule test is always duplicate with
 first 1 or first 2.



 Any idea or what’s the direction to move forward?

 Thanks in advance



 BR,

 Luke

 MYCOM-OSI



 --

 This electronic message contains information from Mycom which may be
 privileged or confidential. The information is intended to be for the use
 of the individual(s) or entity named above. If you are not the intended
 recipient, be aware that any disclosure, copying, distribution or any other
 use of the contents of this information is prohibited. If you have received
 this electronic message in error, please notify us by post or telephone (to
 the numbers or correspondence address above) or by email (at the email
 address above) immediately.
  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random OSDs respawning continuously

2015-02-13 Thread Gregory Farnum
It's not entirely clear, but it looks like all the ops are just your
caching pool OSDs trying to promote objects, and your backing pool OSD's
aren't fast enough to satisfy all the IO demanded of them. You may be
overloading the system.
-Greg
On Fri, Feb 13, 2015 at 6:06 AM Mohamed Pakkeer mdfakk...@gmail.com wrote:

 Hi all,

   When i stop the respawning osd on an OSD node, another osd is respawning
  on the same node. when the OSD is started to respawing, it puts the
 following info in the osd log.

 slow request 31.129671 seconds old, received at 2015-02-13
 19:09:32.180496: osd_op(*osd.551*.95229:11 191 10005c4.0033
 [copy-get max 8388608] 13.f4ccd256 RETRY=50
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg

 OSD.551 is part of cache tier. All the respawning osds have the log with
 different cache tier OSDs. If i restart all the osds in the cache tier osd
 node, respawning is stopped  and cluster become active + clean state. But
 when i try to write some data on the cluster, random osd starts the
 respawning.

 can anyone help me how to solve this issue?


   2015-02-13 19:10:02.309848 7f53eef54700  0 log_channel(default) log
 [WRN] : 11 slow requests, 11 included below; oldest blocked for  30.132629
 secs
 2015-02-13 19:10:02.309854 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.132629 seconds old, received at 2015-02-13
 19:09:32.177075: osd_op(osd.551.95229:63
  10002ae. [copy-from ver 7622] 13.7273b256 RETRY=130 snapc
 1=[] ondisk+retry+write+ignore_overlay+enforce_snapc+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309858 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.131608 seconds old, received at 2015-02-13
 19:09:32.178096: osd_op(osd.551.95229:41
 5 10003a0.0006 [copy-get max 8388608] 13.aefb256 RETRY=118
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309861 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.130994 seconds old, received at 2015-02-13
 19:09:32.178710: osd_op(osd.551.95229:26
 83 100029d.003b [copy-get max 8388608] 13.a2be1256 RETRY=115
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309864 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.130426 seconds old, received at 2015-02-13
 19:09:32.179278: osd_op(osd.551.95229:39
 39 10004e9.0032 [copy-get max 8388608] 13.6a25b256 RETRY=105
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309868 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.129697 seconds old, received at 2015-02-13
 19:09:32.180007: osd_op(osd.551.95229:97
 49 1000553.007e [copy-get max 8388608] 13.c8645256 RETRY=59
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310284 7f53eef54700  0 log_channel(default) log [WRN]
 : 11 slow requests, 6 included below; oldest blocked for  31.133092 secs
 2015-02-13 19:10:03.310305 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.129671 seconds old, received at 2015-02-13
 19:09:32.180496: osd_op(osd.551.95229:11
 191 10005c4.0033 [copy-get max 8388608] 13.f4ccd256 RETRY=50
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310308 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.128616 seconds old, received at 2015-02-13
 19:09:32.181551: osd_op(osd.551.95229:12
 903 10002e4.00d6 [copy-get max 8388608] 13.f56a3256 RETRY=41
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310322 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.127807 seconds old, received at 2015-02-13
 19:09:32.182360: osd_op(osd.551.95229:14
 165 1000480.0110 [copy-get max 8388608] 13.fd8c1256 RETRY=32
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310327 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.127320 seconds old, received at 2015-02-13
 19:09:32.182847: osd_op(osd.551.95229:15
 013 100047f.0133 [copy-get max 8388608] 13.b7b05256 RETRY=27
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310331 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.126935 seconds old, received at 2015-02-13
 19:09:32.183232: osd_op(osd.551.95229:15
 767 100066d.001e [copy-get max 8388608] 13.3b017256 RETRY=25
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 

Re: [ceph-users] kernel crash after 'ceph: mds0 caps stale' and 'mds0 hung' -- issue with timestamps or HVM virtualization on EC2?

2015-02-09 Thread Gregory Farnum
On Mon, Feb 9, 2015 at 11:58 AM, Christopher Armstrong
ch...@opdemand.com wrote:
 Hi folks,

 One of our users is seeing machine crashes almost daily. He's using Ceph
 v0.87 giant, and is seeing this crash:
 https://gist.githubusercontent.com/ianblenke/b74e5aa5547130ebc0fb/raw/c3eeab076310d149443fd6118113b9d94f176303/gistfile1.txt

 It seems easy to trigger this by rsyncing to the CephFS mount. We're using
 the kernel client here, so I'm wondering if it's related to this timestamp
 bug:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-January/045838.html

These are definitely not related.

 Does anyone have any insight into the crash? Some confirmation that it's
 related to system clocks/timestamps would be helpful.

 Another note is that we're using HVM virtualization on EC2. Not sure if
 people have run into this before or not.

Zheng might have some idea about these, but I'm guessing there's a
code issue and some deadlock with file capabilities.

If you can look at the MDS' admin socket and dump the ops in flight
and the session info that might be helpful too. (ceph daemon mds.a
dump_ops_in_flight, etc)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS removal.

2015-02-12 Thread Gregory Farnum
What version of Ceph are you running? It's varied by a bit.

But I think you want to just turn off the MDS and run the fail
command — deactivate is actually the command for removing a logical
MDS from the cluster, and you can't do that for a lone MDS because
there's nobody to pass off the data to. I'll make a ticket to clarify
this. When you've done that you should be able to delete it.
-Greg

On Mon, Feb 2, 2015 at 1:40 AM,  warren.je...@stfc.ac.uk wrote:
 Hi All,



 Having a few problems removing cephfs file systems.



 I want to remove my current pools (was used for test data) – wiping all
 current data, and start a fresh file system on my current cluster.



 I have looked over the documentation but I can’t find anything on this. I
 have an object store pool, Which I don’t want to remove – but I’d like to
 remove the cephfs file system pools and remake them.





 My cephfs is called ‘data’.



 Running ceph fs delete data returns: Error EINVAL: all MDS daemons must be
 inactive before removing filesystem



 To make an MDS inactive I believe the command is: ceph mds deactivate 0



 Which returns: telling mds.0 135.248.53.134:6809/16692 to deactivate



 Checking the status of the mds using: ceph mds stat  returns: e105: 1/1/0 up
 {0=node2=up:stopping}



 This has been sitting at this status for the whole weekend with no change. I
 don’t have any clients connected currently.



 When trying to manually just remove the pools, it’s not allowed as there is
 a cephfs file system on them.



 I’m happy that all of the failsafe’s to stop someone removing a pool are all
 working correctly.



 If this is currently undoable. Is there a way to quickly wipe a cephfs
 filesystem – using RM from a kernel client is really slow.



 Many thanks



 Warren Jeffs


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS removal.

2015-02-12 Thread Gregory Farnum
Oh, hah, your initial email had a very delayed message
delivery...probably got stuck in the moderation queue. :)

On Thu, Feb 12, 2015 at 8:26 AM,  warren.je...@stfc.ac.uk wrote:
 I am running 0.87, In the end I just wiped the cluster and started again - it 
 was quicker.

 Warren

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: 12 February 2015 16:25
 To: Jeffs, Warren (STFC,RAL,ISIS)
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] CephFS removal.

 What version of Ceph are you running? It's varied by a bit.

 But I think you want to just turn off the MDS and run the fail
 command — deactivate is actually the command for removing a logical MDS from 
 the cluster, and you can't do that for a lone MDS because there's nobody to 
 pass off the data to. I'll make a ticket to clarify this. When you've done 
 that you should be able to delete it.
 -Greg

 On Mon, Feb 2, 2015 at 1:40 AM,  warren.je...@stfc.ac.uk wrote:
 Hi All,



 Having a few problems removing cephfs file systems.



 I want to remove my current pools (was used for test data) – wiping
 all current data, and start a fresh file system on my current cluster.



 I have looked over the documentation but I can’t find anything on
 this. I have an object store pool, Which I don’t want to remove – but
 I’d like to remove the cephfs file system pools and remake them.





 My cephfs is called ‘data’.



 Running ceph fs delete data returns: Error EINVAL: all MDS daemons
 must be inactive before removing filesystem



 To make an MDS inactive I believe the command is: ceph mds deactivate
 0



 Which returns: telling mds.0 135.248.53.134:6809/16692 to deactivate



 Checking the status of the mds using: ceph mds stat  returns: e105:
 1/1/0 up {0=node2=up:stopping}



 This has been sitting at this status for the whole weekend with no
 change. I don’t have any clients connected currently.



 When trying to manually just remove the pools, it’s not allowed as
 there is a cephfs file system on them.



 I’m happy that all of the failsafe’s to stop someone removing a pool
 are all working correctly.



 If this is currently undoable. Is there a way to quickly wipe a cephfs
 filesystem – using RM from a kernel client is really slow.



 Many thanks



 Warren Jeffs


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-06 Thread Gregory Farnum
I'm afraid I don't know what would happen if you change those options.
Hopefully we've set it up so things continue to work, but we definitely
don't test it.
-Greg
On Tue, Jan 6, 2015 at 8:22 AM Lionel Bouton lionel+c...@bouton.name
wrote:

 On 01/06/15 02:36, Gregory Farnum wrote:
  [...]
  filestore btrfs snap controls whether to use btrfs snapshots to keep
  the journal and backing store in check. WIth that option disabled it
  handles things in basically the same way we do with xfs.
 
  filestore btrfs clone range I believe controls how we do RADOS
  object clones. With this option enabled we use the btrfs clone range
  ioctl (? I think that's the interface); without it we do our own
  copies, again basically the same as we do with xfs.

 Thanks for these informations I think I have a clearer picture now, the
 next time I have the opportunity, I'll test BTRFS based OSD using manual
 defragmentation (which I suspect might help performance) and if I still
 get stability or performance problems I'll try disabling BTRFS specific
 features.

 My impression is that the core of BTRFS is stable and performant enough
 for Ceph and that lzo compression and checksums are reasons enough to
 use it instead of XFS but to get stable and performant OSDs some
 features might have to be disabled. Hopefully we will expand our storage
 network in the near future and I'll have the opportunity to test my
 theories with very limited impact on stability and performance.

 Quick follow-up question: can the options filestore btrfs snap,
 filestore btrfs clone range and filestore journal parallel be
 modified on an existing/used OSD? I don't see why not for the last 2 as
 COW being used or not doesn't change other filesystem semantics, but for
 snapshots I'm not sure: at startup the available snapshots could have to
 match a precise OSD filestore state which they wouldn't do after (for
 example) disabling them and enabling them again.

 Best regards,

 Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Gregory Farnum
On Wed, Feb 18, 2015 at 3:30 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote:
 Dear Ceph Experts,

 is it possible to define a Ceph user/key with privileges
 that allow for read-only CephFS access but do not allow
 write or other modifications to the Ceph cluster?

 Warning, read this to the end, don't blindly do as I say. :)

 All you should need to do is define a CephX identity that has only r
 capabilities on the data pool (assuming you're using a default
 configuration where your CephFS uses the data and metadata pools):

 sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r
 pool=data' mon 'allow r'

 That identity should then be able to mount the filesystem but not
 write any data (use ceph-fuse -n client.readonly or mount -t ceph
 -o name=readonly)

 That said, just touching files or creating them is only a metadata
 operation that doesn't change anything in the data pool, so I think
 that might still be allowed under these circumstances.

 ...and deletes, unfortunately. :(

 If the file being deleted is empty, yes. If the file has any content,
 then the removal should hit the data pool before it hits metadata, and
 should fail there. No?

No, all data deletion is handled by the MDS, for two reasons:
1) You don't want clients to have to block on deletes in time linear
with the number of objects
2) (IMPORTANT) if clients unlink a file which is still opened
elsewhere, it can't be deleted until closed. ;)


I don't think this is presently a
 thing it's possible to do until we get a much better user auth
 capabilities system into CephFS.


 However, I've just tried the above with ceph-fuse on firefly, and I
 was able to mount the filesystem that way and then echo something into
 a previously existing file. After unmounting, remounting, and trying
 to cat that file, I/O just hangs. It eventually does complete, but
 this looks really fishy.

 This is happening because the CephFS clients don't (can't, really, for
 all the time we've spent thinking about it) check whether they have
 read permissions on the underlying pool when buffering writes for a
 file. I believe if you ran an fsync on the file you'd get an EROFS or
 similar.
 Anyway, the client happily buffers up the writes. Depending on how
 exactly you remount then it might not be able to drop the MDS caps for
 file access (due to having dirty data it can't get rid of), and those
 caps have to time out before anybody else can access the file again.
 So you've found an unpleasant oddity of how the POSIX interfaces map
 onto this kind of distributed system, but nothing unexpected. :)

 Oliver's point is valid though; I would be nice if you could somehow
 make CephFS read-only to some (or all) clients server side, the way an
 NFS ro export does.

Yeah. Yet another thing that would be good but requires real
permission bits on the MDS. It'll happen eventually, but we have other
bits that seem a lot more important...fsck, stability, single-tenant
usability
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-18 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray chrismurra...@gmail.com wrote:
 Hi again Greg :-)

 No, it doesn't seem to progress past that point. I started the OSD again a 
 couple of nights ago:

 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 
 open_ops 1 - 0, max_applied_seq 13288338 - 13288339
 2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 0, 
 op_seq now 13288339
 2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 3951706112 : 
 seq 13288340 1755 bytes
 2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: applying 
 op seq 13288340
 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 
 open_ops 0 - 1
 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 
 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 
 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore   header.spos 0.0.0
 2015-03-16 21:34:48.096002 7fe49a5ac700 20 
 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start 
 max_applied_seq 13288339, open_ops 1
 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 
 open ops to drain

 There's the success line for 13288339, like you mentioned. But not one for 
 13288340.

 Intriguing. So, those same 1755 bytes seem problematic every time the journal 
 is replayed? Interestingly, there is a lot (in time, not exactly data mass or 
 IOPs, but still more than 1755 bytes!) of activity while the log is at this 
 line:

 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 
 open ops to drain

 ... but then the IO ceases and the log still doesn't go any further. I wonder 
 why 13288339 doesn't have that same  'waiting for ... open ops to drain' 
 line. Or the 'woke after' one for that matter.

 While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:

 sdb  20.00 0.00  3404.00  0   3404
 sdb  16.00 0.00  2100.00  0   2100
 sdb  10.00 0.00  1148.00  0   1148
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   1.00 0.00   496.00  0496
 sdb  32.00 0.00  4940.00  0   4940
 sdb   8.00 0.00  1144.00  0   1144
 sdb   1.00 0.00 4.00  0  4
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb  17.00 0.00  3340.00  0   3340
 sdb  23.00 0.00  3368.00  0   3368
 sdb   1.00 0.00 4.00  0  4
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb  13.00 0.00  3332.00  0   3332
 sdb  18.00 0.00  2360.00  0   2360
 sdb  59.00 0.00  7464.00  0   7464
 sdb   0.00 0.00 0.00  0  0

 I was hoping Google may have held some clues, but it seems I'm the only one 
 :-)

 https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22

 I tried removing compress-force=lzo from osd mount options btrfs in 
 ceph.conf, in case it was the 

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-18 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
 Hi Greg,

 Thanks for your input and completely agree that we cannot expect developers
 to fully document what impact each setting has on a cluster, particularly in
 a performance related way

 That said, if you or others could spare some time for a few pointers it
 would be much appreciated and I will endeavour to create some useful
 results/documents that are more relevant to end users.

 I have taken on board what you said about the WB throttle and have been
 experimenting with it by switching it on and off. I know it's a bit of a
 blunt configuration change, but it was useful to understand its effect. With
 it off, I do see initially quite a large performance increase but overtime
 it actually starts to slow the average throughput down. Like you said, I am
 guessing this is to do with it making sure the journal doesn't get to far
 ahead, leaving it with massive sync's to carry out.

 One thing I do see with the WBT enabled and to some extent with it disabled,
 is that there are large periods of small block writes at the max speed of
 the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces
 of performing an OSD bench (64kb io's for 500MB) where this behaviour can be
 seen.

If you're doing 64k IOs then I believe it's creating a new on-disk
file for each of those writes. How that's laid out on-disk will depend
on your filesystem and the specific config options that we're using to
try to avoid running too far ahead of the journal.

I think you're just using these config options in conflict with
eachother. You've set the min sync time to 20 seconds for some reason,
presumably to try and batch stuff up? So in that case you probably
want to let your journal run for twenty seconds worth of backing disk
IO before you start throttling it, and probably 10-20 seconds worth of
IO before forcing file flushes. That means increasing the throttle
limits while still leaving the flusher enabled.
-Greg


 http://www.sys-pro.co.uk/misc/wbt_on.png

 http://www.sys-pro.co.uk/misc/wbt_off.png

 I would really appreciate if someone could comment on why this type of
 behaviour happens? As can be seen in the trace, if the blocks are submitted
 to the disk as larger IO's and with higher concurrency, hundreds of Mb of
 data can be flushed in seconds. Is this something specific to the filesystem
 behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes
 which can't be merged into larger IO's?

 For sequential writes, I would have thought that in an optimum scenario, a
 spinning disk should be able to almost maintain its large block write speed
 (100MB/s) no matter the underlying block size. That being said, from what I
 understand when a sync is called it will try and flush all dirty data so the
 end result is probably slightly different to a traditional battery backed
 write back cache.

 Chris, would you be interested in forming a ceph-users based performance
 team? There's a developer performance meeting which is mainly concerned with
 improving the internals of Ceph. There is also a raft of information on the
 mailing list archives where people have said hey look at my SSD speed at
 x,y,z settings, but making comparisons or recommendations is not that easy.
 It may also reduce a lot of the repetitive posts of why is X so
 slowetc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd laggy algorithm

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov asavi...@asdco.ru wrote:
 hello.
 ceph transfers osd node in the down status by default , after receiving 3
 reports about disabled nodes. Reports are sent per   osd heartbeat grace
 seconds, but the settings of mon_osd_adjust_heartbeat_gratse = true,
 mon_osd_adjust_down_out_interval = true timeout to transfer nodes in down
 status may vary. Tell me please: what algorithm enables changes timeout for
 the transfer nodes occur in down/out status and which parameters are
 affected?
 thanks.

The monitors keep track of which detected failures are incorrect
(based on reports from the marked-down/out OSDs) and build up an
expectation about how often the failures are correct based on an
exponential backoff of the data points. You can look at the code in
OSDMonitor.cc if you're interested, but basically they apply that
expectation to modify the down interval and the down-out interval to a
value large enough that they believe the OSD is really down (assuming
these config options are set). It's not terribly interesting. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:

 I’m not sure if it’s something I’m doing wrong or just experiencing an 
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the 
 writes seem to hit the OSD’s straight away instead of coalescing in the 
 journals, is this correct?

 For example if I create a RBD on a standard 3 way replica pool and run fio 
 via librbd 128k writes, I see the journals take all the io’s until I hit my 
 filestore_min_sync_interval and then I see it start writing to the underlying 
 disks.

 Doing the same on a full cache tier (to force flushing)  I immediately see 
 the base disks at a very high utilisation. The journals also have some write 
 IO at the same time. The only other odd thing I can see via iostat is that 
 most of the time whilst I’m running Fio, is that I can see the underlying 
 disks doing very small write IO’s of around 16kb with an occasional big burst 
 of activity.

 I know erasure coding+cache tier is slower than just plain replicated pools, 
 but even with various high queue depths I’m struggling to get much above 
 100-150 iops compared to a 3 way replica pool which can easily achieve 
 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked 
 difference and I’m wondering if this strange journal behaviour is the cause.

 Does anyone have any ideas?

If you're running a full cache pool, then on every operation touching
an object which isn't in the cache pool it will try and evict an
object. That's probably what you're seeing.

Cache pool in general are only a wise idea if you have a very skewed
distribution of data hotness and the entire hot zone can fit in
cache at once.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont flafdiv...@free.fr wrote:
 Hi,

 I was always in the same situation: I couldn't remove an OSD without
 have some PGs definitely stuck to the active+remapped state.

 But I remembered I read on IRC that, before to mark out an OSD, it
 could be sometimes a good idea to reweight it to 0. So, instead of
 doing [1]:

 ceph osd out 3

 I have tried [2]:

 ceph osd crush reweight osd.3 0 # waiting for the rebalancing...
 ceph osd out 3

 and it worked. Then I could remove my osd with the online documentation:
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

 Now, the osd is removed and my cluster is HEALTH_OK. \o/

 Now, my question is: why my cluster was definitely stuck to active+remapped
 with [1] but was not with [2]? Personally, I have absolutely no explanation.
 If you have an explanation, I'd love to know it.

If I remember/guess correctly, if you mark an OSD out it won't
necessarily change the weight of the bucket above it (ie, the host),
whereas if you change the weight of the OSD then the host bucket's
weight changes. That makes for different mappings, and since you only
have a couple of OSDs per host (normally: hurray!) and not many hosts
(normally: sadness) then marking one OSD out makes things harder for
the CRUSH algorithm.
-Greg


 Should the reweight command be present in the online documentation?
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
 If yes, I can make a pull request on the doc with pleasure. ;)

 Regards.

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Gregory Farnum
The information you're giving sounds a little contradictory, but my
guess is that you're seeing the impacts of object promotion and
flushing. You can sample the operations the OSDs are doing at any
given time by running ops_in_progress (or similar, I forget exact
phrasing) command on the OSD admin socket. I'm not sure if rados df
is going to report cache movement activity or not.

That though would mostly be written to the SSDs, not the hard drives —
although the hard drives could still get metadata updates written when
objects are flushed. What data exactly are you seeing that's leading
you to believe writes are happening against these drives? What is the
exact CephFS and cache pool configuration?
-Greg

On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I forgot to mention: while I am seeing these writes in iotop and
 /proc/diskstats for the hdd's, I am -not- seeing any writes in rados
 df for the pool residing on these disks. There is only one pool active
 on the hdd's and according to rados df it is getting zero writes when
 I'm just reading big files from cephfs.

 So apparently the osd's are doing some non-trivial amount of writing on
 their own behalf. What could it be?

 Thanks,

 Erik.


 On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,

 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.

 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.

 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.

 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
Nothing here particularly surprises me. I don't remember all the
details of the filestore's rate limiting off the top of my head, but
it goes to great lengths to try and avoid letting the journal get too
far ahead of the backing store. Disabling the filestore flusher and
increasing the sync intervals without also increasing the
filestore_wbthrottle_* limits is not going to work well for you.
-Greg

On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?

 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO 
 at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of 
 activity.
 
  I know erasure coding+cache tier is slower than just plain replicated 
  pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?

 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.

 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

 Hi Greg,

 It's not the caching behaviour that I confused about, it’s the journal 
 behaviour on the base disks during flushing. I've been doing some more tests 
 and can do something reproducible which seems strange to me.

 First off 10MB of 4kb writes:
 time ceph tell osd.1 bench 1000 4096
 { bytes_written: 1000,
   blocksize: 4096,
   bytes_per_sec: 16009426.00}

 real0m0.760s
 user0m0.063s
 sys 0m0.022s

 Now split this into 2x5mb writes:
 time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 
 500 4096
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 10580846.00}

 real0m0.595s
 user0m0.065s
 sys 0m0.018s
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 9944252.00}

 real0m4.412s
 user0m0.053s
 sys 0m0.071s

 2nd bench takes a lot longer even though both should easily fit in the 5GB 
 journal. Looking at iostat, I think I can see that no writes happen to the 
 journal whilst the writes from the 1st bench are being flushed. Is this the 
 expected behaviour? I would have thought as long as there is space available 
 in the journal it shouldn't block on new writes. Also I see in iostat writes 
 to the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
 seconds, with a large blip or activity just before the flush finishes. Is 
 this the correct behaviour? I would have thought if this tell osd bench is 
 doing sequential IO then the journal should be able to flush 5-10mb of data 
 in a fraction a second.

 Ceph.conf
 [osd]
 filestore max sync interval = 30
 filestore min sync interval = 20
 filestore flusher = false
 osd_journal_size = 5120
 osd_crush_location_hook = /usr/local/bin/crush-location
 osd_op_threads = 5
 filestore_op_threads = 4


 iostat during period where writes seem to be blocked (journal=sda disk=sdd)

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdb   0.00 0.000.002.00 0.00 4.00 4.00
  0.000.000.000.00   0.00   0.00
 sdc   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdd   0.00 0.000.00   76.00 0.00   760.0020.00
  0.99   13.110.00   13.11  13.05  99.20

 iostat during

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-20 Thread Gregory Farnum
On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote:
 Ah, I was wondering myself if compression could be causing an issue, but I'm 
 reconsidering now. My latest experiment should hopefully help troubleshoot.

 So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I 
 try that:

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 btrfs filesystem defragment -v -czlib -- {} +

 After much, much waiting, all files have been rewritten, but the OSD still 
 gets stuck at the same point.

 I've now unset the compress attribute on all files and started the defragment 
 process again, but I'm not too hopeful since the files must be 
 readable/writeable if I didn't get some failure during the defrag process.

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 chattr -c -- {} +
 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 btrfs filesystem defragment -v -- {} +

 (latter command still running)

 Any other ideas at all? In the absence of the problem being spelled out to me 
 with an error of some sort, I'm not sure how to troubleshoot further.

Not much, sorry.

 Is it safe to upgrade a problematic cluster, when the time comes, in case 
 this ultimately is a CEPH bug which is fixed in something later than 0.80.9?

In general it should be fine since we're careful about backwards
compatibility, but without knowing the actual issue I can't promise
anything.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Readonly cache tiering and rbd.

2015-03-19 Thread Gregory Farnum
On Thu, Mar 19, 2015 at 4:46 AM, Matthijs Möhlmann
matth...@cacholong.nl wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 - From the documentation:

 Cache Tier readonly:

 Read-only Mode: When admins configure tiers with readonly mode, Ceph
 clients write data to the backing tier. On read, Ceph copies the
 requested object(s) from the backing tier to the cache tier. Stale
 objects get removed from the cache tier based on the defined policy.
 This approach is ideal for immutable data (e.g., presenting
 pictures/videos on a social network, DNA data, X-Ray imaging, etc.),
 because reading data from a cache pool that might contain out-of-date
 data provides weak consistency. Do not use readonly mode for mutable data.

 Does this mean that when a client (xen / kvm with a RBD volume) writes
 some data that the OSD does not mark the readonly cache dirty?

Yes, exactly. Reads are directed to the cache but writes go directly
to the base tier, and there's no attempt at communication about the
changed objects.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote:

 Hello,

 On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

 On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Greg,
 
  Thanks for your input and completely agree that we cannot expect
  developers to fully document what impact each setting has on a
  cluster, particularly in a performance related way
 
  That said, if you or others could spare some time for a few pointers it
  would be much appreciated and I will endeavour to create some useful
  results/documents that are more relevant to end users.
 
  I have taken on board what you said about the WB throttle and have been
  experimenting with it by switching it on and off. I know it's a bit of
  a blunt configuration change, but it was useful to understand its
  effect. With it off, I do see initially quite a large performance
  increase but overtime it actually starts to slow the average
  throughput down. Like you said, I am guessing this is to do with it
  making sure the journal doesn't get to far ahead, leaving it with
  massive sync's to carry out.
 
  One thing I do see with the WBT enabled and to some extent with it
  disabled, is that there are large periods of small block writes at the
  max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
  seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
  where this behaviour can be seen.

 If you're doing 64k IOs then I believe it's creating a new on-disk
 file for each of those writes. How that's laid out on-disk will depend
 on your filesystem and the specific config options that we're using to
 try to avoid running too far ahead of the journal.

 Could you elaborate on that a bit?
 I would have expected those 64KB writes to go to the same object (file)
 until it is full (4MB).

 Because this behavior would explain some (if not all) of the write
 amplification I've seen in the past with small writes (see the SSD
 Hardware recommendation thread).

Ah, no, you're right. With the bench command it all goes in to one
object, it's just a separate transaction for each 64k write. But again
depending on flusher and throttler settings in the OSD, and the
backing FS' configuration, it can be a lot of individual updates — in
particular, every time there's a sync it has to update the inode.
Certainly that'll be the case in the described configuration, with
relatively low writeahead limits on the journal but high sync
intervals — once you hit the limits, every write will get an immediate
flush request.

But none of that should have much impact on your write amplification
tests unless you're actually using osd bench to test it. You're more
likely to be seeing the overhead of the pg log entry, pg info change,
etc that's associated with each write.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-19 Thread Gregory Farnum
On Thu, Mar 19, 2015 at 2:41 PM, Nick Fisk n...@fisk.me.uk wrote:
 I'm looking at trialling OSD's with a small flashcache device over them to
 hopefully reduce the impact of metadata updates when doing small block io.
 Inspiration from here:-

 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

 One thing I suspect will happen, is that when the OSD node starts up udev
 could possibly mount the base OSD partition instead of flashcached device,
 as the base disk will have the ceph partition uuid type. This could result
 in quite nasty corruption.

 I have had a look at the Ceph udev rules and can see that something similar
 has been done for encrypted OSD's. Am I correct in assuming that what I need
 to do is to create a new partition uuid type for flashcached OSD's and then
 create a udev rule to activate these new uuid'd OSD's once flashcache has
 finished assembling them?

I haven't worked with the udev rules in a while, but that sounds like
the right way to go.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds log message

2015-03-20 Thread Gregory Farnum
On Fri, Mar 20, 2015 at 12:39 PM, Daniel Takatori Ohara
dtoh...@mochsl.org.br wrote:
 Hello,

 Anybody help me, please? Appear any messages in log of my mds.

 And after the shell of my clients freeze.

 2015-03-20 12:23:54.068005 7f1608d49700  0 log_channel(default) log [WRN] :
 client.3197487 isn't responding to mclientcaps(revoke), ino 11b1696
 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 962.02

Well, this one means that it asked a client to revoke some file
capabilities 962 seconds ago, and the client still hasn't.

 2015-03-20 12:23:54.068135 7f1608d49700  0 log_channel(default) log [WRN] :
 1 slow requests, 1 included below; oldest blocked for  962.028297 secs
 2015-03-20 12:23:54.068142 7f1608d49700  0 log_channel(default) log [WRN] :
 slow request 962.028297 seconds old, received at 2015-03-20 12:07:52.039805:
 client_request(client.3197487:391527 create #11b

And this is a request from the same client to create a file, also
received ~962 seconds ago. This is probably blocked by the
aforementioned capability drop.

Everything that follows these have a good chance of being follow-on
effects. The issue will probably clear itself up if you just restart
the MDS.

We've fixed a lot of bugs around this recently (although it's an
ongoing source of them), so unless you're running very new code I
would just restart and not worry about it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS

2015-03-20 Thread Gregory Farnum
On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com wrote:
 Gregory Farnum greg@... writes:


 On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote:
  Hi,
 
  I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with
  cephFS. I have installed hadoop-1.1.1 in the nodes and changed the
  conf/core-site.xml file according to the ceph documentation
  http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the
  namenode is not starting (namenode can be formatted) but the other
  services(datanode, jobtracker, tasktracker) are running in hadoop.
 
  The default hadoop works fine but when I change the core-site.xml file as
  above I get the following bindException as can be seen from the namenode
 log:
 
 
  2015-03-19 01:37:31,436 ERROR
  org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException:
  Problem binding to node1/10.242.144.225:6789 : Cannot assign requested
 address
 
 
  I have one monitor for the ceph cluster (node1/10.242.144.225) and I
  included in the core-site.xml file ceph://10.242.144.225:6789 as the value
  of fs.default.name. The 6789 port is the default port being used by the
  monitor node of ceph, so that may be the reason for the bindException but
  the ceph documentation mentions that it should be included like this in the
  core-site.xml file. It would be really helpful to get some pointers to 
  where
  I am doing wrong in the setup.

 I'm a bit confused. The NameNode is only used by HDFS, and so
 shouldn't be running at all if you're using CephFS. Nor do I have any
 idea why you've changed anything in a way that tells the NameNode to
 bind to the monitor's IP address; none of the instructions that I see
 can do that, and they certainly shouldn't be.
 -Greg


 Hi Greg,

 I want to run a hadoop job (e.g. terasort) and want to use cephFS instead of
 HDFS. In Using Hadoop with cephFS documentation in
 http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop
 configuration section, the first property fs.default.name has to be set as
 the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/. My
 core-site.xml of hadoop conf looks like this

 configuration

 property
 namefs.default.name/name
 valueceph://10.242.144.225:6789/value
 /property

Yeah, that all makes sense. But I don't understand why or how you're
starting up a NameNode at all, nor what config values it's drawing
from to try and bind to that port. The NameNode is the problem because
it shouldn't even be invoked.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS

2015-03-20 Thread Gregory Farnum
On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan...@gmail.com wrote:
 Hi,

 I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with
 cephFS. I have installed hadoop-1.1.1 in the nodes and changed the
 conf/core-site.xml file according to the ceph documentation
 http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the
 namenode is not starting (namenode can be formatted) but the other
 services(datanode, jobtracker, tasktracker) are running in hadoop.

 The default hadoop works fine but when I change the core-site.xml file as
 above I get the following bindException as can be seen from the namenode log:


 2015-03-19 01:37:31,436 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException:
 Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address


 I have one monitor for the ceph cluster (node1/10.242.144.225) and I
 included in the core-site.xml file ceph://10.242.144.225:6789 as the value
 of fs.default.name. The 6789 port is the default port being used by the
 monitor node of ceph, so that may be the reason for the bindException but
 the ceph documentation mentions that it should be included like this in the
 core-site.xml file. It would be really helpful to get some pointers to where
 I am doing wrong in the setup.

I'm a bit confused. The NameNode is only used by HDFS, and so
shouldn't be running at all if you're using CephFS. Nor do I have any
idea why you've changed anything in a way that tells the NameNode to
bind to the monitor's IP address; none of the instructions that I see
can do that, and they certainly shouldn't be.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Gregory Farnum
On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis
gior...@acmac.uoc.gr wrote:
 Hi all!

 I have recently updated to CEPH version 0.80.9 (latest Firefly release)
 which presumably
 supports direct upload.

 I 've tried to upload a file using this functionality and it seems that is
 working
 for files up to 5GB. For files above 5GB there is an error. I believe that
 this is because
 of a hardcoded limit:

 #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024)


 Is there a way to increase that limit other than compiling CEPH from source?

No.


 Could we somehow put it as a configuration parameter?

Maybe, but I'm not sure if Yehuda would want to take it upstream or
not. This limit is present because it's part of the S3 spec. For
larger objects you should use multi-part upload, which can get much
bigger.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow files

2015-03-16 Thread Gregory Farnum
On Mon, Mar 16, 2015 at 12:12 PM, Craig Lewis cle...@centraldesktop.com wrote:
 Out of curiousity, what's the frequency of the peaks and troughs?

 RadosGW has configs on how long it should wait after deleting before garbage
 collecting, how long between GC runs, and how many objects it can GC in per
 run.

 The defaults are 2 hours, 1 hour, and 32 respectively.  Search
 http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc.

 If your peaks and troughs have a frequency less than 1 hour, then GC is
 going to delay and alias the disk usage w.r.t. the object count.

 If you have millions of objects, you probably need to tweak those values.
 If RGW is only GCing 32 objects an hour, it's never going to catch up.


 Now that I think about it, I bet I'm having issues here too.  I delete more
 than (32*24) objects per day...

Uh, that's not quite what rgw_gc_max_objs mean. That param configures
how the garbage control data objects and internal classes are sharded,
and each grouping will only delete one object at a time. So it
controls the parallelism, but not the total number of objects!

Also, Yehuda says that changing this can be a bit dangerous because it
currently needs to be consistent across any program doing or
generating GC work.
-Greg




 On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote:

 It is either a problem with CEPH, Civetweb or something else in our
 configuration.
 But deletes in user buckets is still leaving a high number of old shadow
 files. Since we have millions and millions of objects, it is hard to
 reconcile what should and shouldnt exist.

 Looking at our cluster usage, there are no troughs, it is just a rising
 peak.
 But when looking at users data usage, we can see peaks and troughs as you
 would expect as data is deleted and added.

 Our ceph version 0.80.9

 Please ideas?

 On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

 - Original Message -

 From: Ben b@benjackson.email
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?


 It depends. Shadow files are badly named objects that represent part
 of the objects data. They are only safe to remove if you know that the
 corresponding objects no longer exist.

 Yehuda


 On 2015-03-11 10:03, Ben wrote:
  We have a large number of shadow files in our cluster that aren't
  being deleted automatically as data is deleted.
 
  Is it safe to delete these files?
  Is there something we need to be aware of when deleting them?
  Is there a script that we can run that will delete these safely?
 
  Is there something wrong with our cluster that it isn't deleting these
  files when it should be?
 
  We are using civetweb with radosgw, with tengine ssl proxy infront of
  it
 
  Any advice please
  Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Gregory Farnum
This might be related to the backtrace assert, but that's the problem
you need to focus on. In particular, both of these errors are caused
by the scrub code, which Sage suggested temporarily disabling — if
you're still getting these messages, you clearly haven't done so
successfully.

That said, it looks like the problem is that the object and/or object
info specified here are just totally busted. You probably want to
figure out what happened there since these errors are normally a
misconfiguration somewhere (e.g., setting nobarrier on fs mount and
then losing power). I'm not sure if there's a good way to repair the
object, but if you can lose the data I'd grab the ceph-objectstore
tool and remove the object from each OSD holding it that way. (There's
a walkthrough of using it for a similar situation in a recent Ceph
blog post.)

On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 Alright, tried a few suggestions for repairing this state, but I don't seem
 to have any PG replicas that have good copies of the missing / zero length
 shards. What do I do now? telling the pg's to repair doesn't seem to help
 anything? I can deal with data loss if I can figure out which images might
 be damaged, I just need to get the cluster consistent enough that the things
 which aren't damaged can be usable.

 Also, I'm seeing these similar, but not quite identical, error messages as
 well. I assume they are referring to the same root problem:

 -1 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:
 soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
 size 4194304

Mmm, unfortunately that's a different object than the one referenced
in the earlier crash. Maybe it's repairable, or it might be the same
issue — looks like maybe you've got some widespread data loss.
-Greg




 On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:

 Finally found an error that seems to provide some direction:

 -1 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
 e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
 not match object info size (4120576) ajusted for ondisk to (4120576)

 I'm diving into google now and hoping for something useful. If anyone has
 a suggestion, I'm all ears!

 QH

 On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:

 Thanks for the suggestion, but that doesn't seem to have made a
 difference.

 I've shut the entire cluster down and brought it back up, and my config
 management system seems to have upgraded ceph to 0.80.8 during the reboot.
 Everything seems to have come back up, but I am still seeing the crash
 loops, so that seems to indicate that this is definitely something
 persistent, probably tied to the OSD data, rather than some weird transient
 state.


 On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:

 It looks like you may be able to work around the issue for the moment
 with

  ceph osd set nodeep-scrub

 as it looks like it is scrub that is getting stuck?

 sage


 On Fri, 6 Mar 2015, Quentin Hartman wrote:

  Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
  active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
  an osd crash log (in github gist because it was too big for pastebin)
  -
  https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
  And now I've got four OSDs that are looping.
 
  On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
So I'm in the middle of trying to triage a problem with my ceph
cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
The cluster has been running happily for about a year. This last
weekend, something caused the box running the MDS to sieze hard,
and when we came in on monday, several OSDs were down or
unresponsive. I brought the MDS and the OSDs back on online, and
managed to get things running again with minimal data loss. Had
to mark a few objects as lost, but things were apparently
running fine at the end of the day on Monday.
  This afternoon, I noticed that one of the OSDs was apparently stuck in
  a crash/restart loop, and the cluster was unhappy. Performance was in
  the tank and ceph status is reporting all manner of problems, as one
  would expect if an OSD is misbehaving. I marked the offending OSD out,
  and the cluster started rebalancing as expected. However, I noticed a
  short while later, another OSD has started into a crash/restart loop.
  So, I repeat the process. And it happens again. At this point I
  notice, that there are actually two at a time which are in this state.
 
  It's as if there's some toxic chunk of data that is getting passed
  around, and when it lands on an OSD it kills it. Contrary to that,
  however, I tried just stopping an OSD when it's in a bad state, and
  once the cluster starts 

Re: [ceph-users] flock() supported on CephFS through Fuse ?

2015-03-10 Thread Gregory Farnum
On Tue, Mar 10, 2015 at 4:20 AM, Florent B flor...@coppint.com wrote:
 Hi all,

 I'm testing flock() locking system on CephFS (Giant) using Fuse.

 It seems that lock works per client, and not over all clients.

 Am I right or is it supposed to work over different clients ? Does MDS
 has such a locking system and is it supported through Fuse ?

 Thank you.

 P.S.: I use a simple PHP script to test it, attached.

flock and fcntl locking has been supported in the kernel client for
many years, but was only implemented for ceph-fuse recently. It will
be in hammer and was backported for the next firefly point release,
but is unlikely to go into giant (unless of course somebody from the
community does the backport and enough testing ;).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with free Inodes

2015-03-24 Thread Gregory Farnum
On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote:
 On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote:

 Yes I read it and do no not understand what you mean when say *verify
 this*? All 3335808 inodes are definetly files and direcories created by
 ceph OSD process:

 What I mean is how/why did Ceph create 3+ million files, where in the tree
 are they actually or are they evenly distributed in the respective PG
 sub-directories.

 Or to ask it differently, how large is your cluster (how many OSDs,
 objects), in short the output of ceph -s.

 If cache-tiers actually are reserving each object that exists on the
 backing store (even if there isn't data in it yet on the cache tier) and
 your cluster is large enough, it might explain this.

Nope. As you've said, this doesn't make any sense unless the objects
are all ludicrously small (and you can't actually get 10-byte objects
in Ceph; the names alone tend to be bigger than that) or something
else is using up inodes.


 And that should both be mentioned and precautions to not run out of inodes
 should be made by the Ceph code.

 If not, this may be a bug after all.

 Would be nice if somebody from the Ceph devs could have gander at this.

 Christian

 *tune2fs 1.42.5 (29-Jul-2012)*
 Filesystem volume name:   none
 Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index
 filetype extent flex_bg sparse_super large_file huge_file uninit_bg
 dir_nlink extra_isize
 Filesystem flags: signed_directory_hash
 Default mount options:user_xattr acl
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 *Inode count:  3335808*
 Block count:  13342945
 Reserved block count: 667147
 Free blocks:  5674105
 *Free inodes:  0*
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  1020
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 8176
 Inode blocks per group:   511
 Flex block group size:16
 Filesystem created:   Fri Feb 20 16:44:25 2015
 Last mount time:  Tue Mar 24 09:33:19 2015
 Last write time:  Tue Mar 24 09:33:27 2015
 Mount count:  7
 Maximum mount count:  -1
 Last checked: Fri Feb 20 16:44:25 2015
 Check interval:   0 (none)
 Lifetime writes:  4116 GB
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Required extra isize: 28
 Desired extra isize:  28
 Journal inode:8
 Default directory hash:   half_md4
 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
 Journal backup:   inode blocks

 *fsck.ext4 /dev/sda1*
 e2fsck 1.42.5 (29-Jul-2012)
 /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks

 23.03.2015 17:09, Christian Balzer пишет:
  On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:
 
  Yes, I understand that.
 
  The initial purpose of first email was just an advise for new comers.
  My fault was in that I was selected ext4 for SSD disks as backend.
  But I  did not foresee that inode number can reach its limit before
  the free space :)
 
  And maybe there must be some sort of warning not only for free space
  in MiBs(GiBs,TiBs) and there must be dedicated warning about free
  inodes for filesystems with static inode allocation  like ext4.
  Because if OSD reach inode limit it becames totally unusable and
  immediately goes down, and from that moment there is no way to start
  it!
 
  While all that is true and should probably be addressed, please re-read
  what I wrote before.
 
  With the 3.3 million inodes used and thus likely as many files (did you
  verify this?) and 4MB objects that would make something in the 12TB
  ballpark area.
 
  Something very very strange and wrong is going on with your cache tier.
 
  Christian
 
  23.03.2015 13:42, Thomas Foster пишет:
  You could fix this by changing your block size when formatting the
  mount-point with the mkfs -b command.  I had this same issue when
  dealing with the filesystem using glusterfs and the solution is to
  either use a filesystem that allocates inodes automatically or change
  the block size when you build the filesystem.  Unfortunately, the
  only way to fix the problem that I have seen is to reformat
 
  On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin
  kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:
 
   In my case there was cache pool for ec-pool serving RBD-images,
   and object size is 4Mb, and client was an /kernel-rbd /client
   each SSD disk is 60G disk, 2 disk per node,  6 nodes in total =
  12 OSDs in total

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Gregory Farnum
On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi,
 due to two more hosts (now 7 storage nodes) I want to create an new
 ec-pool and get an strange effect:

 ceph@admin:~$ ceph health detail
 HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
 pgs stuck undersized; 2 pgs undersized

This is the big clue: you have two undersized PGs!

 pg 22.3e5 is stuck unclean since forever, current state
 active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]

2147483647 is the largest number you can represent in a signed 32-bit
integer. There's an output error of some kind which is fixed
elsewhere; this should be -1.

So for whatever reason (in general it's hard on CRUSH trying to select
N entries out of N choices), CRUSH hasn't been able to map an OSD to
this slot for you. You'll want to figure out why that is and fix it.
-Greg

 pg 22.240 is stuck unclean since forever, current state
 active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
 pg 22.3e5 is stuck undersized for 406.614447, current state
 active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
 pg 22.240 is stuck undersized for 406.616563, current state
 active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
 pg 22.3e5 is stuck degraded for 406.614566, current state
 active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
 pg 22.240 is stuck degraded for 406.616679, current state
 active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
 pg 22.3e5 is active+undersized+degraded, acting
 [76,15,82,11,57,29,2147483647]
 pg 22.240 is active+undersized+degraded, acting
 [38,85,17,74,2147483647,10,58]

 But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647!
 Where the heck came the 2147483647 from?

 I do following commands:
 ceph osd erasure-code-profile set 7hostprofile k=5 m=2
 ruleset-failure-domain=host
 ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile

 my version:
 ceph -v
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)


 I found an issue in my crush-map - one SSD was twice in the map:
 host ceph-061-ssd {
 id -16  # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0  # rjenkins1
 }
 root ssd {
 id -13  # do not change unnecessarily
 # weight 0.780
 alg straw
 hash 0  # rjenkins1
 item ceph-01-ssd weight 0.170
 item ceph-02-ssd weight 0.170
 item ceph-03-ssd weight 0.000
 item ceph-04-ssd weight 0.170
 item ceph-05-ssd weight 0.170
 item ceph-06-ssd weight 0.050
 item ceph-07-ssd weight 0.050
 item ceph-061-ssd weight 0.000
 }

 Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd,
 but after fix the crusmap the issue with the osd 2147483647 still excist.

 Any idea how to fix that?

 regards

 Udo

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] error creating image in rbd-erasure-pool

2015-03-25 Thread Gregory Farnum
Yes.

On Wed, Mar 25, 2015 at 4:13 AM, Frédéric Nass
frederic.n...@univ-lorraine.fr wrote:
 Hi Greg,

 Thank you for this clarification. It helps a lot.

 Does this can't think of any issues apply to both rbd and pool snapshots ?

 Frederic.

 

 On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote:

 Hi Loic and Markus,
 By the way, Inktank do not support snapshot of a pool with cache tiering
 :

*
 https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf

 Hi,

 You seem to be talking about pool snapshots rather than RBD snapshots.
 But in the linked document it is not clear that there is a distinction:

 Can I use snapshots with a cache tier?
 Snapshots are not supported in conjunction with cache tiers.

 Can anyone clarify if this is just pool snapshots?

 I think that was just a decision based on the newness and complexity
 of the feature for product purposes. Snapshots against cache tiered
 pools certainly should be fine in Giant/Hammer and we can't think of
 any issues in Firefly off the tops of our heads.
 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --

 Cordialement,

 Frédéric Nass.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -w: Understanding MB data versus MB used

2015-03-25 Thread Gregory Farnum
On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello there,

 I started to push data into my ceph cluster. There is something I
 cannot understand in the output of ceph -w.

 When I run ceph -w I get this kinkd of output:

 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056
 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail


 2379MB is actually the data I pushed into the cluster, I can see it
 also in the ceph df output, and the numbers are consistent.

 What I dont understand is 19788MB used. All my pools have size 3, so I
 expected something like 2379 * 3. Instead this number is very big.

 I really need to understand how MB used grows because I need to know
 how many disks to buy.

MB used is the summation of (the programmatic equivalent to) df
across all your nodes, whereas MB data is calculated by the OSDs
based on data they've written down. Depending on your configuration
MB used can include thing like the OSD journals, or even totally
unrelated data if the disks are shared with other applications.

MB used including the space used by the OSD journals is my first
guess about what you're seeing here, in which case you'll notice that
it won't grow any faster than MB data does once the journal is fully
allocated.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-25 Thread Gregory Farnum
On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote:
 Dear All,

 Please forgive this post if it's naive, I'm trying to familiarise myself
 with cephfs!

 I'm using Scientific Linux 6.6. with Ceph 0.87.1

 My first steps with cephfs using a replicated pool worked OK.

 Now trying now to test cephfs via a replicated caching tier on top of an
 erasure pool. I've created an erasure pool, cannot put it under the existing
 replicated pool.

 My thoughts were to delete the existing cephfs, and start again, however I
 cannot delete the existing cephfs:

 errors are as follows:

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing filesystem

 I've tried killing the ceph-mds process, but this does not prevent the above
 error.

 I've also tried this, which also errors:

 [root@ceph1 ~]# ceph mds stop 0
 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate

Right, so did you run ceph mds set_max_mds 0 and then repeating the
stop command? :)


 This also fail...

 [root@ceph1 ~]# ceph-deploy mds destroy
 [ceph_deploy.conf][DEBUG ] found configuration file at:
 /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy
 [ceph_deploy.mds][ERROR ] subcommand destroy not implemented

 Am I doing the right thing in trying to wipe the original cephfs config
 before attempting to use an erasure cold tier? Or can I just redefine the
 cephfs?

Yeah, unfortunately you need to recreate it if you want to try and use
an EC pool with cache tiering, because CephFS knows what pools it
expects data to belong to. Things are unlikely to behave correctly if
you try and stick an EC pool under an existing one. :(

Sounds like this is all just testing, which is good because the
suitability of EC+cache is very dependent on how much hot data you
have, etc...good luck!
-Greg


 many thanks,

 Jake Grimmett
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Slow writes with 1MB files

2015-03-30 Thread Gregory Farnum
On Sat, Mar 28, 2015 at 10:12 AM, Barclay Jameson
almightybe...@gmail.com wrote:
 I redid my entire Ceph build going back to to CentOS 7 hoping to the
 get the same performance I did last time.
 The rados bench test was the best I have ever had with a time of 740
 MB wr and 1300 MB rd. This was even better than the first rados bench
 test that had performance equal to PanFS. I find that this does not
 translate to my CephFS. Even with the following tweaking it still at
 least twice as slow as PanFS and my first *Magical* build (that had
 absolutely no tweaking):

 OSD
  osd_op_treads 8
  /sys/block/sd*/queue/nr_requests 4096
  /sys/block/sd*/queue/read_ahead_kb 4096

 Client
  rsize=16777216
  readdir_max_bytes=16777216
  readdir_max_entries=16777216

 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS.
 Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.

 Strange thing is none of the resources are taxed.
 CPU, ram, network, disks, are not even close to being taxed on either
 the client,mon/mds, or the osd nodes.
 The PanFS client node was a 10Gb network the same as the CephFS client
 but you can see the huge difference in speed.

 As per Gregs questions before:
 There is only one client reading and writing (time cp Small1/*
 Small2/.) but three clients have cephfs mounted, although they aren't
 doing anything on the filesystem.

 I have done another test where I stream data info a file as fast as
 the processor can put it there.
 (for (i=0; i  11; i++){ fprintf (out_file, I is : %d\n,i);}
 ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
 above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
 seconds for CephFS although the first build did it in 130 seconds
 without any tuning.

 This leads me to believe the bottleneck is the mds. Does anybody have
 any thoughts on this?
 Are there any tuning parameters that I would need to speed up the mds?

This is pretty likely, but 10 creates/second is just impossibly slow.
The only other thing I can think of is that you might have enabled
fragmentation but aren't now, which might make an impact on a
directory with 100k entries.

Or else your hardware is just totally wonky, which we've seen in the
past but your server doesn't look quite large enough to be hitting any
of the nasty NUMA stuff...but that's something else to look at which I
can't help you with, although maybe somebody else can.

If you're interested in diving into it and depending on the Ceph
version you're running you can also examine the mds perfcounters
(http://ceph.com/docs/master/dev/perf_counters/) and the op history
(dump_ops_in_flight etc) and look for any operations which are
noticeably slow.
-Greg


 On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Yes it's the exact same hardware except for the MDS server (although I
 tried using the MDS on the old node).
 I have not tried moving the MON back to the old node.

 My default cache size is mds cache size = 1000
 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
 I created 2048 for data and metadata:
 ceph osd pool create cephfs_data 2048 2048
 ceph osd pool create cephfs_metadata 2048 2048


 To your point on clients competing against each other... how would I check 
 that?

 Do you have multiple clients mounted? Are they both accessing files in
 the directory(ies) you're testing? Were they accessing the same
 pattern of files for the old cluster?

 If you happen to be running a hammer rc or something pretty new you
 can use the MDS admin socket to explore a bit what client sessions
 there are and what they have permissions on and check; otherwise
 you'll have to figure it out from the client side.
 -Greg


 Thanks for the input!


 On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote:
 So this is exactly the same test you ran previously, but now it's on
 faster hardware and the test is slower?

 Do you have more data in the test cluster? One obvious possibility is
 that previously you were working entirely in the MDS' cache, but now
 you've got more dentries and so it's kicking data out to RADOS and
 then reading it back in.

 If you've got the memory (you appear to) you can pump up the mds
 cache size config option quite dramatically from it's default 10.

 Other things to check are that you've got an appropriately-sized
 metadata pool, that you've not got clients competing against each
 other inappropriately, etc.
 -Greg

 On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Opps I should have said that I am not just writing the data but copying 
 it :

 time cp Small1/* Small2/*

 Thanks,

 BJ

 On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I did a Ceph cluster install 2 weeks ago where I was getting great
 performance (~= PanFS) where I could write 100,000 1MB files

Re: [ceph-users] SSD Journaling

2015-03-30 Thread Gregory Farnum
On Mon, Mar 30, 2015 at 1:01 PM, Garg, Pankaj
pankaj.g...@caviumnetworks.com wrote:
 Hi,

 I’m benchmarking my small cluster with HDDs vs HDDs with SSD Journaling. I
 am using both RADOS bench and Block device (using fio) for testing.

 I am seeing significant Write performance improvements, as expected. I am
 however seeing the Reads coming out a bit slower on the SSD Journaling side.
 They are not terribly different, but sometimes 10% slower.

 Is that something other folks have also seen, or do I need some settings to
 be tuned properly? I’m wondering if accessing 2 drives for reads, adds
 latency and hence the throughput suffers.

You're not reading off of the journal in any case (it's only read on restart).

If I were to guess then the SSD journaling is just building up enough
dirty data ahead of the backing filesystem that if you do a read it
takes a little longer for the data to be readable through the local
filesystem. There have been a number of threads here about configuring
the journal which you might want to grab out of an archiving system
and look at. :)
-Greg




 Thanks

 Pankaj


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to change the MDS node after its been created

2015-03-30 Thread Gregory Farnum
On Mon, Mar 30, 2015 at 3:15 PM, Francois Lafont flafdiv...@free.fr wrote:
 Hi,

 Gregory Farnum wrote:

 The MDS doesn't have any data tied to the machine you're running it
 on. You can either create an entirely new one on a different machine,
 or simply copy the config file and cephx keyring to the appropriate
 directories. :)

 Sorry to enter in this post but how can we *remove* a mds daemon of a
 ceph cluster?

 Are the commands below enough?

 stop the daemon
 rm -r /var/lib/ceph/mds/ceph-$id/
 ceph auth del mds.$id

 Should we edit something in the mds map to remove once and for
 all the mds ?

As long as you turn on another MDS which takes over the logical rank
of the MDS you remove, you don't need to remove anything from the
cluster store.

Note that if you just copy the directory and keyring to the new
location you shouldn't do the ceph auth del bit either. ;)
-Greg


 --
 François Lafont

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to change the MDS node after its been created

2015-03-30 Thread Gregory Farnum
On Mon, Mar 30, 2015 at 1:51 PM, Steve Hindle mech...@gmail.com wrote:

 Hi!

   I mistakenly created my MDS node on the 'wrong' server a few months back.
 Now I realized I placed it on a machine lacking IPMI and would like to move
 it to another node in my cluster.

   Is it possible to non-destructively move an MDS ?

The MDS doesn't have any data tied to the machine you're running it
on. You can either create an entirely new one on a different machine,
or simply copy the config file and cephx keyring to the appropriate
directories. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host failure bring down the whole cluster

2015-03-30 Thread Gregory Farnum
On Mon, Mar 30, 2015 at 8:02 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 On Tue, 31 Mar 2015 02:42:27 AM Kai KH Huang wrote:
 Hi, all
 I have a two-node Ceph cluster, and both are monitor and osd. When
 they're both up, osd are all up and in, everything is fine... almost:



 Two things.

 1 -  You *really* need a min of three monitors. Ceph cannot form a quorum with
 just two monitors and you run a risk of split brain.

You can form quorums with an even number of monitors, and Ceph does so
— there's no risk of split brain.

The problem with 2 monitors is that a quorum is always 2 — which is
exactly what you're seeing right now. You can't run with only one
monitor up (assuming you have a non-zero number of them).

 2 - You also probably have a min size of two set (the default). This means
 that you need a minimum  of two copies of each data object for writes to work.
 So with just two nodes, if one goes down you can't write to the other.

Also this.



 So:
 - Install a extra monitor node - it doesn't have to be powerful, we just use a
 Intel Celeron NUC for that.

 - reduce your minimum size to 1 (One).

Yep.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Gregory Farnum
I don't know why you're mucking about manually with the rbd directory;
the rbd tool and rados handle cache pools correctly as far as I know.
-Greg

On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi Greg,
 ok!

 It's looks like, that my problem is more setomapval-related...

 I must o something like
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51

 but rados setomapval don't use the hexvalues - instead of this I got
 rados -p ssd-archiv listomapvals rbd_directory
 name_vm-409-disk-2
 value: (35 bytes) :
  : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
 0020 : 63 35 31: c51


 hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
 name_vm-409-disk-2 name_vm-409-disk-2
 I got the binary inside the file name_vm-409-disk-2, but reverse do an
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 name_vm-409-disk-2
 fill the variable with name_vm-409-disk-2 and not with the content of the 
 file...

 Are there other tools for the rbd_directory?

 regards

 Udo

 Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)

 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg

 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any 
 content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting 
 data throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Gregory Farnum
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last
 friday I got everything deployed and all was working well, and I set noout
 and shut all the OSD nodes down over the weekend. Yesterday when I spun it
 back up, the OSDs were behaving very strangely, incorrectly marking each
 other because of missed heartbeats, even though they were up. It looked like
 some kind of low-level networking problem, but I couldn't find any.

 After much work, I narrowed the apparent source of the problem down to the
 OSDs running on the first host I started in the morning. They were the ones
 that were logged the most messages about not being able to ping other OSDs,
 and the other OSDs were mostly complaining about them. After running out of
 other ideas to try, I restarted them, and then everything started working.
 It's still working happily this morning. It seems as though when that set of
 OSDs started they got stale OSD map information from the MON boxes, which
 failed to be updated as the other OSDs came up. Does that make sense? I
 still don't consider myself an expert on ceph architecture and would
 appreciate and corrections or other possible interpretations of events (I'm
 happy to provide whatever additional information I can) so I can get a
 deeper understanding of things. If my interpretation of events is correct,
 it seems that might point at a bug.

I can't find the ticket now, but I think we did indeed have a bug
around heartbeat failures when restarting nodes. This has been fixed
in other branches but might have been missed for giant. (Did you by
any chance set the nodown flag as well as noout?)

In general Ceph isn't very happy with being shut down completely like
that and its behaviors aren't validated, so nothing will go seriously
wrong but you might find little irritants like this. It's particularly
likely when you're prohibiting state changes with the noout/nodown
flags.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One of three monitors can not be started

2015-03-31 Thread Gregory Farnum
On Tue, Mar 31, 2015 at 2:50 AM, 张皓宇 zhanghaoyu1...@hotmail.com wrote:
 Who can help me?

 One monitor in my ceph cluster can not be started.
 Before that, I added '[mon] mon_compact_on_start = true' to
 /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell
 mon.computer05 compact ' on computer05, which has a monitor on it.
 When store.db of computer05 changed from 108G to 1G,  mon.computer06 stoped,
 and it can not be started since that.

 If I start mon.computer06, it will stop on this state:
 # /etc/init.d/ceph start mon.computer06
 === mon.computer06 ===
 Starting Ceph mon.computer06 on computer06...

 The process info is like this:
 root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start
 mon.computer06
 root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768;
 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid
 -c /etc/ceph/ceph.conf
 root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06
 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf
 root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i computer06
 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf

 Log on computer06 is like this:
 2015-03-30 20:46:54.152956 7fc5379d07a0  0 ceph version 0.72.2
 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309
 ...
 2015-03-30 20:46:54.759791 7fc5379d07a0  1 mon.computer06@-1(probing) e4
 preinit clean up potentially inconsistent store state

So I haven't looked at this code in a while, but I think the monitor
is trying to validate that it's consistent with the others. You
probably want to dig around the monitor admin sockets and see what
state each monitor is in, plus its perception of the others.

In this case, I think maybe mon.computer06 is trying to examine its
whole store, but 100GB is a lot (way too much, in fact), so this can
take a lng time.


 Sorry, my English is not good.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Gregory Farnum
On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 Thanks for the extra info Gregory. I did not also set nodown.

 I expect that I will be very rarely shutting everything down in the normal
 course of things, but it has come up a couple times when having to do some
 physical re-organizing of racks. Little irritants like this aren't a big
 deal if people know to expect them, but as it is I lost quite a lot of time
 troubleshooting a non-existant problem. What's the best way to get notes to
 that effect added to the docs? It seems something in
 http://ceph.com/docs/master/rados/operations/operating/ would save some
 people some headache. I'm happy to propose edits, but a quick look doesn't
 reveal a process for submitting that sort of thing.

Github pull requests. :)


 My understanding is that the right method to take an entire cluster
 offline is to set noout and then shutting everything down. Is there a better
 way?

That's probably the best way to do it. Like I said, there was also a
bug here that I think is fixed for Hammer but that might not have been
backported to Giant. Unfortunately I don't remember the right keywords
as I wasn't involved in the fix.
-Greg


 QH

 On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
  I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
  Last
  friday I got everything deployed and all was working well, and I set
  noout
  and shut all the OSD nodes down over the weekend. Yesterday when I spun
  it
  back up, the OSDs were behaving very strangely, incorrectly marking each
  other because of missed heartbeats, even though they were up. It looked
  like
  some kind of low-level networking problem, but I couldn't find any.
 
  After much work, I narrowed the apparent source of the problem down to
  the
  OSDs running on the first host I started in the morning. They were the
  ones
  that were logged the most messages about not being able to ping other
  OSDs,
  and the other OSDs were mostly complaining about them. After running out
  of
  other ideas to try, I restarted them, and then everything started
  working.
  It's still working happily this morning. It seems as though when that
  set of
  OSDs started they got stale OSD map information from the MON boxes,
  which
  failed to be updated as the other OSDs came up. Does that make sense? I
  still don't consider myself an expert on ceph architecture and would
  appreciate and corrections or other possible interpretations of events
  (I'm
  happy to provide whatever additional information I can) so I can get a
  deeper understanding of things. If my interpretation of events is
  correct,
  it seems that might point at a bug.

 I can't find the ticket now, but I think we did indeed have a bug
 around heartbeat failures when restarting nodes. This has been fixed
 in other branches but might have been missed for giant. (Did you by
 any chance set the nodown flag as well as noout?)

 In general Ceph isn't very happy with being shut down completely like
 that and its behaviors aren't validated, so nothing will go seriously
 wrong but you might find little irritants like this. It's particularly
 likely when you're prohibiting state changes with the noout/nodown
 flags.
 -Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Gregory Farnum
On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote:
 Hi,

 I'm still trying to find why there is much more write operations on
 filestore since Emperor/Firefly than from Dumpling.

Do you have any history around this? It doesn't sound familiar,
although I bet it's because of the WBThrottle and flushing changes.


 So, I add monitoring of all perf counters values from OSD.

 From what I see : «filestore.ops» reports an average of 78 operations
 per seconds. But, block device monitoring reports an average of 113
 operations per seconds (+45%).
 please thoses 2 graphs :
 - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
 - https://daevel.fr/img/firefly/osd-70.sda-ops.png

That's unfortunate but perhaps not surprising — any filestore op can
change a backing file (which requires hitting both the file and the
inode: potentially two disk seeks), as well as adding entries to the
leveldb instance.
-Greg


 Do you see what can explain this difference ? (this OSD use XFS)

 Thanks,
 Olivier

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Gregory Farnum
On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello,

 I started to work with CEPH few weeks ago, I might ask a very newbie
 question, but I could not find an answer in the docs or in the ml
 archive for this.

 Quick description of my setup:
 I have a ceph cluster with two servers. Each server has 3 SSD drives I
 use for journal only. To map to different failure domains SAS disks
 that keep a journal to the same SSD drive, I wrote my own crushmap.
 I have now a total of 36OSD. Ceph health returns HEALTH_OK.
 I run the cluster with a couple of pools with size=3 and min_size=3


 Production operations questions:
 I manually stopped some OSDs to simulate a failure.

 As far as I understood, an OSD down condition is not enough to make
 CEPH start making new copies of objects. I noticed that I must mark
 the OSD as out to make ceph produce new copies.
 As far as I understood min_size=3 puts the object in readonly if there
 are not at least 3 copies of the object available.

That is correct, but the default with size 3 is 2 and you probably
want to do that instead. If you have size==min_size on firefly
releases and lose an OSD it can't do recovery so that PG is stuck
without manual intervention. :( This is because of some quirks about
how the OSD peering and recovery works, so you'd be forgiven for
thinking it would recover nicely.
(This is changed in the upcoming Hammer release, but you probably
still want to allow cluster activity when an OSD fails, unless you're
very confident in their uptime and more concerned about durability
than availability.)
-Greg


 Is this behavior correct or I made some mistake creating the cluster ?
 Should I expect ceph to produce automatically a new copy for objects
 when some OSDs are down ?
 There is any option to mark automatically out OSDs that go down ?

 thanks

 Saverio
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-23 Thread Gregory Farnum
On Mon, Mar 23, 2015 at 4:31 AM, f...@univ-lr.fr f...@univ-lr.fr wrote:
 Hi Somnath,

 Thank you, please find my answers below

 Somnath Roy somnath@sandisk.com a écrit le 22/03/15 18:16 :

 Hi Frederick,

 Need some information here.



 1. Just to clarify, you are saying it is happening g in 0.87.1 and not in
 Firefly ?

 That's a possibility, others running similar hardware (and possibly OS, I
 can ask) confirm they dont have such visible comportment on Firefly.
 I'd need to install Firefly on our hosts to be sure.
 We run on RHEL.



 2. Is it happening after some hours of run or just right away ?

 It's happening on freshly installed hosts and goes on.



 3. Please provide ‘perf top’ output of all the OSD nodes.

 Here they are :
 http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
 http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html

 The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu
 difference. We don't see them on 'low-cpu' nodes :

 12,15%  libtcmalloc.so.4.1.2  [.]
 tcmalloc::CentralFreeList::FetchFromSpans

Huh. The tcmalloc (memory allocator) workload should be roughly the
same across all nodes, especially if they have equivalent
distributions of PGs and primariness as you describe. Are you sure
this is a persistent CPU imbalance or are they oscillating? Are there
other processes on some of the nodes which could be requiring memory
from the system?

Either you've found a new bug in our memory allocator or something
else is going on in the system to make it behave differently across
your nodes.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't Start OSD

2015-03-23 Thread Gregory Farnum
On Sun, Mar 22, 2015 at 11:22 AM, Somnath Roy somnath@sandisk.com wrote:
 You should be having replicated copies on other OSDs (disks), so, no need to 
 worry about the data loss. You add a new drive and follow the steps in the 
 following link (either 1 or 2)

Except that's not the case if you only had one copy of the PG, as
seems to be indicated by the last acting[1] output all over that
health warning. :/
You certainly should have a copy of the data elsewhere, but that
message means you *didn't*; presumably you had 2 copies of everything
and either your CRUSH map was bad (which should have provoked lots of
warnings?) or you've lost more than one OSD.
-Greg


 1. For manual deployment, 
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

 2. With ceph-deploy, 
 http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/

 After successful deployment, rebalancing should start and eventually cluster 
 will come to healthy state.

 Thanks  Regards
 Somnath


 -Original Message-
 From: Noah Mehl [mailto:noahm...@combinedpublic.com]
 Sent: Sunday, March 22, 2015 11:15 AM
 To: Somnath Roy
 Cc: ceph-users@lists.ceph.com
 Subject: Re: Can't Start OSD

 Somnath,

 You are correct, there are dmesg errors about the drive.  How can I replace 
 the drive?  Can I copy all of the readable contents from this drive to a new 
 one?  Because I have the following output from “ceph health detail”

 HEALTH_WARN 43 pgs stale; 43 pgs stuck stale pg 7.5b7 is stuck stale for 
 5954121.993990, current state stale+active+clean, last acting [1] pg 7.42a is 
 stuck stale for 5954121.993885, current state stale+active+clean, last acting 
 [1] pg 7.669 is stuck stale for 5954121.994072, current state 
 stale+active+clean, last acting [1] pg 7.121 is stuck stale for 
 5954121.993586, current state stale+active+clean, last acting [1] pg 7.4ec is 
 stuck stale for 5954121.993956, current state stale+active+clean, last acting 
 [1] pg 7.1e4 is stuck stale for 5954121.993670, current state 
 stale+active+clean, last acting [1] pg 7.41f is stuck stale for 
 5954121.993901, current state stale+active+clean, last acting [1] pg 7.59f is 
 stuck stale for 5954121.994024, current state stale+active+clean, last acting 
 [1] pg 7.39 is stuck stale for 5954121.993490, current state 
 stale+active+clean, last acting [1] pg 7.584 is stuck stale for 
 5954121.994026, current state stale+active+clean, last acting [1] pg 7.fd is 
 stuck stale for 5954121.993600, current state stale+active+clean, last acting 
 [1] pg 7.6fd is stuck stale for 5954121.994158, current state 
 stale+active+clean, last acting [1] pg 7.4b5 is stuck stale for 
 5954121.993975, current state stale+active+clean, last acting [1] pg 7.328 is 
 stuck stale for 5954121.993840, current state stale+active+clean, last acting 
 [1] pg 7.4a9 is stuck stale for 5954121.993981, current state 
 stale+active+clean, last acting [1] pg 7.569 is stuck stale for 
 5954121.994046, current state stale+active+clean, last acting [1] pg 7.629 is 
 stuck stale for 5954121.994119, current state stale+active+clean, last acting 
 [1] pg 7.623 is stuck stale for 5954121.994118, current state 
 stale+active+clean, last acting [1] pg 7.6dd is stuck stale for 
 5954121.994179, current state stale+active+clean, last acting [1] pg 7.3d5 is 
 stuck stale for 5954121.993935, current state stale+active+clean, last acting 
 [1] pg 7.54b is stuck stale for 5954121.994058, current state 
 stale+active+clean, last acting [1] pg 7.3cf is stuck stale for 
 5954121.993938, current state stale+active+clean, last acting [1] pg 7.c4 is 
 stuck stale for 5954121.993633, current state stale+active+clean, last acting 
 [1] pg 7.178 is stuck stale for 5954121.993719, current state 
 stale+active+clean, last acting [1] pg 7.3b8 is stuck stale for 
 5954121.993946, current state stale+active+clean, last acting [1] pg 7.b1 is 
 stuck stale for 5954121.993635, current state stale+active+clean, last acting 
 [1] pg 7.5fb is stuck stale for 5954121.994146, current state 
 stale+active+clean, last acting [1] pg 7.236 is stuck stale for 
 5954121.993801, current state stale+active+clean, last acting [1] pg 7.2f5 is 
 stuck stale for 5954121.993881, current state stale+active+clean, last acting 
 [1] pg 7.ac is stuck stale for 5954121.993643, current state 
 stale+active+clean, last acting [1] pg 7.16d is stuck stale for 
 5954121.993738, current state stale+active+clean, last acting [1] pg 7.6b7 is 
 stuck stale for 5954121.994223, current state stale+active+clean, last acting 
 [1] pg 7.5ea is stuck stale for 5954121.994166, current state 
 stale+active+clean, last acting [1] pg 7.a3 is stuck stale for 
 5954121.993654, current state stale+active+clean, last acting [1] pg 7.52d is 
 stuck stale for 5954121.994110, current state stale+active+clean, last acting 
 [1] pg 7.2d8 is stuck stale for 5954121.993904, current state 
 stale+active+clean, last acting [1] pg 7.2db is stuck stale for 
 5954121.993903, 

Re: [ceph-users] How does crush selects different osds using hash(pg) in diferent iterations

2015-03-23 Thread Gregory Farnum
On Sat, Mar 21, 2015 at 10:46 AM, shylesh kumar shylesh.mo...@gmail.com wrote:
 Hi ,

 I was going through this simplified crush algorithm given in ceph website.

 def crush(pg):
all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
result = []
# size is the number of copies; primary+replicas
while len(result)  size:
-- r = hash(pg)
chosen = all_osds[ r % len(all_osds) ]
if chosen in result:
# OSD can be picked only once
continue
result.append(chosen)
return result

 10:24 PM (51 minutes ago)
 In the line where r = hash(pg) , will it gives the same hash value in every
 iteration ?
 if that is the case we always endup choosing the same osd from the list
 or will the pg number be used as seed for the hashing so that r value
 changes in the next iteration.

 Am I missing something really basic ??
 Can somebody please provide me some pointers ?

I'm not sure where this bit of documentation came from, but the
selection process includes the attempt number as one of the inputs.
Where the attempt starts at 0 (or 1, I dunno) and increments each time
we try to map a new OSD to the PG.
-Greg




 --
 Thanks,
 Shylesh Kumar M


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Gregory Farnum
On Mon, Mar 23, 2015 at 7:17 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello,

 thanks for the answers.

 This was exacly what I was looking for:

 mon_osd_down_out_interval = 900

 I was not waiting long enoght to see my cluster recovering by itself.
 That's why I tried to increase min_size, because I did not understand
 what min_size was for.

 Now that I know what is min_size, I guess the best setting for me is
 min_size = 1 because I would like to be able to make I/O operations
 even of only 1 copy is left.

I'd strongly recommend leaving it at two — if you reduce it to 1 then
you can lose data by having just one disk die at an inopportune
moment, whereas if you leave it at 2 the system won't accept any
writes to only one hard drive. Leaving it at two the system will still
try and re-replicate back up to three copies after mon osd down out
interval time has elapsed from a failure. :)
-Greg


 Thanks to all for helping !

 Saverio



 2015-03-23 14:58 GMT+01:00 Gregory Farnum g...@gregs42.com:
 On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello,

 I started to work with CEPH few weeks ago, I might ask a very newbie
 question, but I could not find an answer in the docs or in the ml
 archive for this.

 Quick description of my setup:
 I have a ceph cluster with two servers. Each server has 3 SSD drives I
 use for journal only. To map to different failure domains SAS disks
 that keep a journal to the same SSD drive, I wrote my own crushmap.
 I have now a total of 36OSD. Ceph health returns HEALTH_OK.
 I run the cluster with a couple of pools with size=3 and min_size=3


 Production operations questions:
 I manually stopped some OSDs to simulate a failure.

 As far as I understood, an OSD down condition is not enough to make
 CEPH start making new copies of objects. I noticed that I must mark
 the OSD as out to make ceph produce new copies.
 As far as I understood min_size=3 puts the object in readonly if there
 are not at least 3 copies of the object available.

 That is correct, but the default with size 3 is 2 and you probably
 want to do that instead. If you have size==min_size on firefly
 releases and lose an OSD it can't do recovery so that PG is stuck
 without manual intervention. :( This is because of some quirks about
 how the OSD peering and recovery works, so you'd be forgiven for
 thinking it would recover nicely.
 (This is changed in the upcoming Hammer release, but you probably
 still want to allow cluster activity when an OSD fails, unless you're
 very confident in their uptime and more concerned about durability
 than availability.)
 -Greg


 Is this behavior correct or I made some mistake creating the cluster ?
 Should I expect ceph to produce automatically a new copy for objects
 when some OSDs are down ?
 There is any option to mark automatically out OSDs that go down ?

 thanks

 Saverio
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph binary missing from ceph-0.87.1-0.el6.x86_64

2015-03-02 Thread Gregory Farnum
The ceph tool got moved into ceph-common at some point, so it
shouldn't be in the ceph rpm. I'm not sure what step in the
installation process should have handled that, but I imagine it's your
problem.
-Greg

On Mon, Mar 2, 2015 at 11:24 AM, Michael Kuriger mk7...@yp.com wrote:
 Hi all,
 When doing a fresh install on a new cluster, and using the latest rpm
 (0.87.1) ceph-deploy fails right away.  I checked the files inside the rpm,
 and /usr/bin/ceph is not there.  Upgrading from the previous rpm seems to
 work, but ceph-deploy is pulling the latest rpm automatically.


 [ceph201][DEBUG ] connected to host: ceph201

 [ceph201][DEBUG ] detect platform information from remote host

 [ceph201][DEBUG ] detect machine type

 [ceph_deploy.install][INFO  ] Distro info: CentOS 6.5 Final

 [ceph201][INFO  ] installing ceph on ceph201

 [ceph201][INFO  ] Running command: yum clean all

 [ceph201][DEBUG ] Loaded plugins: fastestmirror, security

 [ceph201][DEBUG ] Cleaning repos: base updates-released ceph-released

 [ceph201][DEBUG ] Cleaning up Everything

 [ceph201][DEBUG ] Cleaning up list of fastest mirrors

 [ceph201][INFO  ] Running command: rpm --import
 https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc

 [ceph201][INFO  ] Running command: rpm -Uvh --replacepkgs
 http://ceph.com/rpm-firefly/el6/noarch/ceph-release-1-0.el6.noarch.rpm

 [ceph201][DEBUG ] Retrieving
 http://ceph.com/rpm-firefly/el6/noarch/ceph-release-1-0.el6.noarch.rpm

 [ceph201][DEBUG ] Preparing...
 ##

 [ceph201][DEBUG ] ceph-release
 ##

 [ceph201][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a high
 priority

 [ceph201][WARNIN] altered ceph.repo priorities to contain: priority=1

 [ceph201][INFO  ] Running command: yum -y install ceph

 [ceph201][DEBUG ] Loaded plugins: fastestmirror, security

 [ceph201][DEBUG ] Determining fastest mirrors

 [ceph201][DEBUG ] Setting up Install Process

 [ceph201][DEBUG ] Resolving Dependencies

 [ceph201][DEBUG ] -- Running transaction check

 [ceph201][DEBUG ] --- Package ceph.x86_64 1:0.87.1-0.el6 will be installed

 [ceph201][DEBUG ] -- Finished Dependency Resolution

 [ceph201][DEBUG ]

 [ceph201][DEBUG ] Dependencies Resolved

 [ceph201][DEBUG ]

 [ceph201][DEBUG ]
 

 [ceph201][DEBUG ]  Package  Arch   Version
 RepositorySize

 [ceph201][DEBUG ]
 

 [ceph201][DEBUG ] Installing:

 [ceph201][DEBUG ]  ceph x86_64 1:0.87.1-0.el6
 ceph-released  13 M

 [ceph201][DEBUG ]

 [ceph201][DEBUG ] Transaction Summary

 [ceph201][DEBUG ]
 

 [ceph201][DEBUG ] Install   1 Package(s)

 [ceph201][DEBUG ]

 [ceph201][DEBUG ] Total download size: 13 M

 [ceph201][DEBUG ] Installed size: 50 M

 [ceph201][DEBUG ] Downloading Packages:

 [ceph201][DEBUG ] Running rpm_check_debug

 [ceph201][DEBUG ] Running Transaction Test

 [ceph201][DEBUG ] Transaction Test Succeeded

 [ceph201][DEBUG ] Running Transaction

   Installing : 1:ceph-0.87.1-0.el6.x86_64
 1/1

   Verifying  : 1:ceph-0.87.1-0.el6.x86_64
 1/1

 [ceph201][DEBUG ]

 [ceph201][DEBUG ] Installed:

 [ceph201][DEBUG ]   ceph.x86_64 1:0.87.1-0.el6

 [ceph201][DEBUG ]

 [ceph201][DEBUG ] Complete!

 [ceph201][INFO  ] Running command: ceph --version

 [ceph201][ERROR ] Traceback (most recent call last):

 [ceph201][ERROR ]   File
 /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/process.py,
 line 87, in run

 [ceph201][ERROR ] reporting(conn, result, timeout)

 [ceph201][ERROR ]   File
 /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/log.py,
 line 13, in reporting

 [ceph201][ERROR ] received = result.receive(timeout)

 [ceph201][ERROR ]   File
 /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/lib/vendor/execnet/gateway_base.py,
 line 704, in receive

 [ceph201][ERROR ] raise self._getremoteerror() or EOFError()

 [ceph201][ERROR ] RemoteError: Traceback (most recent call last):

 [ceph201][ERROR ]   File string, line 1036, in executetask

 [ceph201][ERROR ]   File remote exec, line 11, in _remote_run

 [ceph201][ERROR ]   File /usr/lib64/python2.6/subprocess.py, line 642, in
 __init__

 [ceph201][ERROR ] errread, errwrite)

 [ceph201][ERROR ]   File /usr/lib64/python2.6/subprocess.py, line 1234, in
 _execute_child

 [ceph201][ERROR ] raise child_exception

 [ceph201][ERROR ] OSError: [Errno 2] No such file or directory

 [ceph201][ERROR ]

 [ceph201][ERROR ]



 Michael Kuriger




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___

Re: [ceph-users] CephFS Attributes Question Marks

2015-03-02 Thread Gregory Farnum
I bet it's that permission issue combined with a minor bug in FUSE on
that kernel, or maybe in the ceph-fuse code (but I've not seen it
reported before, so I kind of doubt it). If you run ceph-fuse with
debug client = 20 it will output (a whole lot of) logging to the
client's log file and you could see what requests are getting
processed by the Ceph code and how it's responding. That might let you
narrow things down. It's certainly not any kind of timeout.
-Greg

On Mon, Mar 2, 2015 at 3:57 PM, Scottix scot...@gmail.com wrote:
 3 Ceph servers on Ubuntu 12.04.5 - kernel 3.13.0-29-generic

 We have an old server that we compiled the ceph-fuse client on
 Suse11.4 - kernel 2.6.37.6-0.11
 This is the only mount we have right now.

 We don't have any problems reading the files and the directory shows full
 775 permissions and doing a second ls fixes the problem.

 On Mon, Mar 2, 2015 at 3:51 PM Bill Sanders billysand...@gmail.com wrote:

 Forgive me if this is unhelpful, but could it be something to do with
 permissions of the directory and not Ceph at all?

 http://superuser.com/a/528467

 Bill

 On Mon, Mar 2, 2015 at 3:47 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote:
  We have a file system running CephFS and for a while we had this issue
  when
  doing an ls -la we get question marks in the response.
 
  -rw-r--r-- 1 wwwrun root14761 Feb  9 16:06
  data.2015-02-08_00-00-00.csv.bz2
  -? ? ?  ?   ??
  data.2015-02-09_00-00-00.csv.bz2
 
  If we do another directory listing it show up fine.
 
  -rw-r--r-- 1 wwwrun root14761 Feb  9 16:06
  data.2015-02-08_00-00-00.csv.bz2
  -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21
  data.2015-02-09_00-00-00.csv.bz2
 
  It hasn't been a problem but just wanted to see if this is an issue,
  could
  the attributes be timing out? We do have a lot of files in the
  filesystem so
  that could be a possible bottleneck.

 Huh, that's not something I've seen before. Are the systems you're
 doing this on the same? What distro and kernel version? Is it reliably
 one of them showing the question marks, or does it jump between
 systems?
 -Greg

 
  We are using the ceph-fuse mount.
  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
  We are planning to do the update soon to 87.1
 
  Thanks
  Scottie
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Attributes Question Marks

2015-03-02 Thread Gregory Farnum
On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote:
 We have a file system running CephFS and for a while we had this issue when
 doing an ls -la we get question marks in the response.

 -rw-r--r-- 1 wwwrun root14761 Feb  9 16:06
 data.2015-02-08_00-00-00.csv.bz2
 -? ? ?  ?   ??
 data.2015-02-09_00-00-00.csv.bz2

 If we do another directory listing it show up fine.

 -rw-r--r-- 1 wwwrun root14761 Feb  9 16:06
 data.2015-02-08_00-00-00.csv.bz2
 -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21
 data.2015-02-09_00-00-00.csv.bz2

 It hasn't been a problem but just wanted to see if this is an issue, could
 the attributes be timing out? We do have a lot of files in the filesystem so
 that could be a possible bottleneck.

Huh, that's not something I've seen before. Are the systems you're
doing this on the same? What distro and kernel version? Is it reliably
one of them showing the question marks, or does it jump between
systems?
-Greg


 We are using the ceph-fuse mount.
 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 We are planning to do the update soon to 87.1

 Thanks
 Scottie


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update 0.80.5 to 0.80.8 --the VM's read request become too slow

2015-03-02 Thread Gregory Farnum
On Mon, Mar 2, 2015 at 7:15 PM, Nathan O'Sullivan nat...@mammoth.com.au wrote:

 On 11/02/2015 1:46 PM, 杨万元 wrote:

 Hello!
 We use Ceph+Openstack in our private cloud. Recently we upgrade our
 centos6.5 based cluster from Ceph Emperor to Ceph Firefly.
 At first,we use redhat yum repo epel to upgrade, this Ceph's version is
 0.80.5. First upgrade monitor,then osd,last client. when we complete this
 upgrade, we boot a VM on the cluster,then use fio to test the io
 performance. The io performance is as better as before. Everything is ok!
 Then we upgrade the cluster from 0.80.5 to 0.80.8,when we  completed ,
 we reboot the VM to load the newest librbd. after that we also use fio to
 test the io performance.then we find the randwrite and write is as good as
 before.but the randread and read is become worse, randwrite's iops from
 4000-5000 to 300-400 ,and the latency is worse. the write's bw from 400MB/s
 to 115MB/s. then I downgrade the ceph client version from 0.80.8 to 0.80.5,
 then the reslut become  normal.
  So I think maybe something cause about librbd.  I compare the 0.80.8
 release notes with 0.80.5
 (http://ceph.com/docs/master/release-notes/#v0-80-8-firefly ), I just find
 this change in  0.80.8 is something about read request  :  librbd: cap
 memory utilization for read requests (Jason Dillaman)  .  Who can  explain
 this?


 FWIW we are seeing the same thing when switching librbd from 0.80.7 to
 0.80.8 - there is a massive performance regression in random reads.   In our
 case, from ~10,000 4k read iops down to less than 1,000.

 We also tested librbd 0.87.1 , and found it does not have this problem - it
 appears to be isolated to 0.80.8 only.

I'm not familiar with the details of the issue, but we're putting out
0.80.9 as soon as we can and should resolve this. There was an
incomplete backport or something that is causing the slowness.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] problem in cephfs for remove empty directory

2015-03-03 Thread Gregory Farnum
On Tue, Mar 3, 2015 at 9:24 AM, John Spray john.sp...@redhat.com wrote:
 On 03/03/2015 14:07, Daniel Takatori Ohara wrote:

 $ls test-daniel-old/
 total 0
 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar  2 10:52 ./
 drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar  2 11:41 ../

 $rm -rf test-daniel-old/
 rm: cannot remove ‘test-daniel-old/’: Directory not empty

 $ls test-daniel-old/
 ls: cannot access
 test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file
 or directory
 ls: cannot access
 test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such
 file or directory
 ls: cannot access
 test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file
 or directory
 ls: cannot access
 test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such
 file or directory
 ls: cannot access
 test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file
 or directory
 ls: cannot access
 test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such
 file or directory
 ls: cannot access
 test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file
 or directory
 ls: cannot access
 test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such
 file or directory
 total 0
 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar  2 10:52 ./
 drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar  2 11:41 ../
 l? ? ?  ?   ??
 M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam
 l? ? ?  ?   ??
 M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam

 You don't say what version of the client (version of kernel, if it's the
 kernel client) this is.  It would appear that the client thinks there are
 some dentries that don't really exist.  You should enable verbose debug logs
 (with fuse client, debug client = 20) and reproduce this.  It looks like
 you had similar issues (subject: problem for remove files in cephfs) a
 while back, when Yan Zheng also advised you to get some debug logs.

In particular this is a known bug in older kernels and is fixed in new
enough ones. Unfortunately I don't have the bug link handy though. :(
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shutting down a cluster fully and powering it back up

2015-02-28 Thread Gregory Farnum
Sounds good!
-Greg
On Sat, Feb 28, 2015 at 10:55 AM David da...@visions.se wrote:

 Hi!

 I’m about to do maintenance on a Ceph Cluster, where we need to shut it
 all down fully.
 We’re currently only using it for rados block devices to KVM Hypervizors.

 Are these steps sane?

 Shutting it down

 1. Shut down all IO to the cluster. Means turning off all clients (KVM
 Hypervizors in our case).
 2. Set cluster to noout by running: ceph osd set noout
 3. Shut down the MON nodes.
 4. Shut down the OSD nodes.

 Starting it up

 1. Start the OSD nodes.
 2. Start the MON nodes.
 3. Check ceph -w to see the status of ceph and take actions if something
 is wrong.
 4. Start up the clients (KVM Hypervizors)
 5. Run ceph osd unset noout

 Kind Regards,
 David
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] old osds take much longer to start than newer osd

2015-03-02 Thread Gregory Farnum
This is probably LevelDB being slow. The monitor has some options to
compact the store on startup and I thought the osd handled it
automatically, but you could try looking for something like that and see if
it helps.
-Greg
On Fri, Feb 27, 2015 at 5:02 AM Corin Langosch corin.lango...@netskin.com
wrote:

 Hi guys,

 I'm using ceph for a long time now, since bobtail. I always upgraded every
 few weeks/ months to the latest stable
 release. Of course I also removed some osds and added new ones. Now during
 the last few upgrades (I just upgraded from
 80.6 to 80.8) I noticed that old osds take much longer to startup than
 equal newer osds (same amount of data/ disk
 usage, same kind of storage+journal backing device (ssd), same weight,
 same number of pgs, ...). I know I observed the
 same behavior earlier but just didn't really care about it. Here are the
 relevant log entries (host of osd.0 and osd.15
 has less cpu power than the others):

 old osds (average pgs load time: 1.5 minutes)

 2015-02-27 13:44:23.134086 7ffbfdcbe780  0 osd.0 19323 load_pgs
 2015-02-27 13:49:21.453186 7ffbfdcbe780  0 osd.0 19323 load_pgs opened 824
 pgs

 2015-02-27 13:41:32.219503 7f197b0dd780  0 osd.3 19317 load_pgs
 2015-02-27 13:42:56.310874 7f197b0dd780  0 osd.3 19317 load_pgs opened 776
 pgs

 2015-02-27 13:38:43.909464 7f450ac90780  0 osd.6 19309 load_pgs
 2015-02-27 13:40:40.080390 7f450ac90780  0 osd.6 19309 load_pgs opened 806
 pgs

 2015-02-27 13:36:14.451275 7f3c41d33780  0 osd.9 19301 load_pgs
 2015-02-27 13:37:22.446285 7f3c41d33780  0 osd.9 19301 load_pgs opened 795
 pgs

 new osds (average pgs load time: 3 seconds)

 2015-02-27 13:44:25.529743 7f2004617780  0 osd.15 19325 load_pgs
 2015-02-27 13:44:36.197221 7f2004617780  0 osd.15 19325 load_pgs opened
 873 pgs

 2015-02-27 13:41:29.176647 7fb147fb3780  0 osd.16 19315 load_pgs
 2015-02-27 13:41:31.681722 7fb147fb3780  0 osd.16 19315 load_pgs opened
 848 pgs

 2015-02-27 13:38:41.470761 7f9c404be780  0 osd.17 19307 load_pgs
 2015-02-27 13:38:43.737473 7f9c404be780  0 osd.17 19307 load_pgs opened
 821 pgs

 2015-02-27 13:36:10.997766 7f7315e99780  0 osd.18 19299 load_pgs
 2015-02-27 13:36:13.511898 7f7315e99780  0 osd.18 19299 load_pgs opened
 815 pgs

 The old osds also take more memory, here's an example:

 root 15700 22.8  0.7 1423816 485552 ?  Ssl  13:36   4:55
 /usr/bin/ceph-osd -i 9 --pid-file
 /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.conf --cluster ceph
 root 15270 15.4  0.4 1227140 297032 ?  Ssl  13:36   3:20
 /usr/bin/ceph-osd -i 18 --pid-file
 /var/run/ceph/osd.18.pid -c /etc/ceph/ceph.conf --cluster ceph


 It seems to me there is still some old data around for the old osds which
 was not properly migrated/ cleaned up during
 the upgrades. The cluster is healthy, no problems at all the last few
 weeks. Is there any way to clean this up?

 Thanks
 Corin
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-02 Thread Gregory Farnum
You can turn the filestore up to 20 instead of 1. ;) You might also
explore what information you can get out of the admin socket.

You are correct that those numbers are the OSD epochs, although note
that when the system is running you'll get output both for the OSD as
a whole and for individual PGs within it (which can be lagging
behind). I'm still pretty convinced the OSDs are simply stuck trying
to bring their PGs up to date and are thrashing the maps on disk, but
we're well past what I can personally diagnose without log diving.
-Greg

On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray chrismurra...@gmail.com wrote:
 After noticing that the number increases by 101 on each attempt to start
 osd.11, I figured I was only 7 iterations away from the output being
 within 101 of 63675. So, I killed the osd process, started it again,
 lather, rinse, repeat. I then did the same for other OSDs. Some created
 very small logs, and some created logs into the gigabytes. Grepping the
 latter for update_osd_stat showed me where the maps were up to, and
 therefore which OSDs needed some special attention. Some of the epoch
 numbers appeared to increase by themselves to a point and then plateaux,
 after which I'd kill then start the osd again, and this number would
 start to increase again.

 After all either showed 63675, or nothing at all, I turned debugging
 back off, deleted logs, and tried to bring the cluster back by unsetting
 noup, nobackfill, norecovery etc. It hasn't got very far before
 appearing stuck again, with nothing progressing in ceph status. It
 appears that 11/15 OSDs are now properly up, but four still aren't. A
 lot of placement groups are stale, so I guess I really need the
 remaining four to come up.

 The OSDs in question are 1, 7, 10  12. All have a line similar to this
 as the last in their log:

 2015-02-28 10:35:04.240822 7f375ef40780  1 journal _open
 /var/lib/ceph/osd/ceph-1/journal fd 21: 5367660544 bytes, block size
 4096 bytes, directio = 1, aio = 1

 Even with the following in ceph.conf, I'm not seeing anything after that
 last line in the log.

  debug osd = 20
  debug filestore = 1

 CPU is still being consumed by the ceph-osd process though, but not much
 memory is being used compared to the other two OSDs which are up on that
 node.

 Is there perhaps even further logging that I can use to see why the logs
 aren't progressing past this point?
 Osd.1 is on /dev/sdb. iostat still shows some activity as the minutes go
 on, but not much:

 (60 second intervals)
 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
 sdb   5.45 0.00   807.33  0  48440
 sdb   5.75 0.00   807.33  0  48440
 sdb   5.43 0.00   807.20  0  48440

 Thanks,
 Chris

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Chris Murray
 Sent: 27 February 2015 10:32
 To: Gregory Farnum
 Cc: ceph-users
 Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
 the cluster recover without help?

 A little further logging:

 2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:15.745619 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:23.530913 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26380 con 0xe1f0cc60
 2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:30.645938 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:33.531142 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26540 con 0xe1f0cc60
 2015-02-27 10:27:43.531333 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26700 con 0xe1f0cc60
 2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:45.546311 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:53.531564 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60
 2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:56.846627 7fe8e3f2f700  5 osd.11 62839

Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Gregory Farnum
On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote:
 Hi all, especially devs,

 We have recently pinpointed one of the causes of slow requests in our
 cluster. It seems deep-scrubs on pg's that contain the index file for a
 large radosgw bucket lock the osds. Incresing op threads and/or disk threads
 helps a little bit, but we need to increase them beyond reason in order to
 completely get rid of the problem. A somewhat similar (and more severe)
 version of the issue occurs when we call listomapkeys for the index file,
 and since the logs for deep-scrubbing was much harder read, this inspection
 was based on listomapkeys.

 In this example osd.121 is the primary of pg 10.c91 which contains file
 .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
 ~500k objects. Standard listomapkeys call take about 3 seconds.

 time rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null
 real 0m2.983s
 user 0m0.760s
 sys 0m0.148s

 In order to lock the osd we request 2 of them simultaneously with something
 like:

 rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
 sleep 1
 rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 

 'debug_osd=30' logs show the flow like:

 At t0 some thread enqueue_op's my omap-get-keys request.
 Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
 keys.
 Op-Thread B responds to several other requests during that 1 second sleep.
 They're generally extremely fast subops on other pgs.
 At t1 (about a second later) my second omap-get-keys request gets
 enqueue_op'ed. But it does not start probably because of the lock held by
 Thread A.
 After that point other threads enqueue_op other requests on other pgs too
 but none of them starts processing, in which i consider the osd is locked.
 At t2 (about another second later) my first omap-get-keys request is
 finished.
 Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
 reading ~500k keys again.
 Op-Thread A continues to process the requests enqueued in t1-t2.

 It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can
 process other requests for other pg's just fine.

 My guess is a somewhat larger scenario happens in deep-scrubbing, like on
 the pg containing index for the bucket of 20M objects. A disk/op thread
 starts reading through the omap which will take say 60 seconds. During the
 first seconds, other requests for other pgs pass just fine. But in 60
 seconds there are bound to be other requests for the same pg, especially
 since it holds the index file. Each of these requests lock another disk/op
 thread to the point where there are no free threads left to process any
 requests for any pg. Causing slow-requests.

 So first of all thanks if you can make it here, and sorry for the involved
 mail, i'm exploring the problem as i go.
 Now, is that deep-scrubbing situation i tried to theorize even possible? If
 not can you point us where to look further.
 We are currently running 0.72.2 and know about newer ioprio settings in
 Firefly and such. While we are planning to upgrade in a few weeks but i
 don't think those options will help us in any way. Am i correct?
 Are there any other improvements that we are not aware?

This is all basically correct; it's one of the reasons you don't want
to let individual buckets get too large.

That said, I'm a little confused about why you're running listomapkeys
that way. RGW throttles itself by getting only a certain number of
entries at a time (1000?) and any system you're also building should
do the same. That would reduce the frequency of any issues, and I
*think* that scrubbing has some mitigating factors to help (although
maybe not; it's been a while since I looked at any of that stuff).

Although I just realized that my vague memory of deep scrubbing
working better might be based on improvements that only got in for
firefly...not sure.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Log Rotation (firefly)

2015-03-02 Thread Gregory Farnum
On Mon, Mar 2, 2015 at 8:44 AM, Daniel Schneller
daniel.schnel...@centerdevice.com wrote:
 On our Ubuntu 14.04/Firefly 0.80.8 cluster we are seeing
 problem with log file rotation for the rados gateway.

 The /etc/logrotate.d/radosgw script gets called, but
 it does not work correctly. It spits out this message,
 coming from the postrotate portion:

/etc/cron.daily/logrotate:
reload: Unknown parameter: id
invoke-rc.d: initscript radosgw, action reload failed.

 A new log file actually gets created, but due to the
 failure in the post-rotate script, the daemon actually
 continues writing into the now deleted previous file:

[B|root@node01]  /etc/init ➜  ps aux | grep radosgw
root 13077  0.9  0.1 13710396 203256 ? Ssl  Feb14 212:27
 /usr/bin/radosgw -n client.radosgw.node01

[B|root@node01]  /etc/init ➜  ls -l /proc/13077/fd/
total 0
lr-x-- 1 root root 64 Mar  2 15:53 0 - /dev/null
lr-x-- 1 root root 64 Mar  2 15:53 1 - /dev/null
lr-x-- 1 root root 64 Mar  2 15:53 2 - /dev/null
l-wx-- 1 root root 64 Mar  2 15:53 3 -
 /var/log/radosgw/radosgw.log.1 (deleted)
...

 Trying manually with   service radosgw reload  fails with
 the same message. Running the non-upstart
 /etc/init.d/radosgw reload   works. It will, kind of crudely,
 just send a SIGHUP to any running radosgw process.

 To figure out the cause I compared OSDs and RadosGW wrt
 to upstart and got this:

[B|root@node01]  /etc/init ➜  initctl list | grep osd
ceph-osd-all start/running
ceph-osd-all-starter stop/waiting
ceph-osd (ceph/8) start/running, process 12473
ceph-osd (ceph/9) start/running, process 12503
...

[B|root@node01]  /etc/init ➜  initctl reload radosgw cluster=ceph
 id=radosgw.node01
initctl: Unknown instance: ceph/radosgw.node01

[B|root@node01]  /etc/init ➜  initctl list | grep rados
radosgw-instance stop/waiting
radosgw stop/waiting
radosgw-all-starter stop/waiting
radosgw-all start/running

 Apart from me not being totally clear about what the difference
 between radosgw-instance and radosgw is, obviously Upstart
 has no idea about which PID to send the SIGHUP to when I ask
 it to reload.

 I can, of course, replace the logrotate config and use the
 /etc/init.d/radosgw reload  approach, but I would like to
 understand if this is something unique to our system, or if
 this is a bug in the scripts.

 FWIW here's an excerpt from /etc/ceph.conf:

[client.radosgw.node01]
host = node01
rgw print continue = false
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/radosgw/radosgw.log
rgw enable ops log = false
rgw gc max objs = 31

I'm not very (well, at all, for rgw) familiar with these scripts, but
how are you starting up your RGW daemon? There's some way to have
Apache handle the process instead of Upstart, but Yehuda says you
don't want to do it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What does the parameter journal_align_min_size mean?

2015-03-02 Thread Gregory Farnum
On Fri, Feb 27, 2015 at 5:03 AM, Mark Wu wud...@gmail.com wrote:

 I am wondering how the value of journal_align_min_size gives impact on
 journal padding. Is there any document describing the disk layout of
 journal?

Not much, unfortunately. Just looking at the code, the journal will
align any writes which are at least as large as that parameter,
apparently based on the page size and the target offset within the
destination object. I think this is so that it's more conveniently
aligned for transfer into the filesystem later on, whereas smaller
writes can just get copied?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster Address

2015-03-04 Thread Gregory Farnum
On Tue, Mar 3, 2015 at 9:26 AM, Garg, Pankaj
pankaj.g...@caviumnetworks.com wrote:
 Hi,

 I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD
 nodes). I kept the same public and private address for configuration.

 I do have 2 NICS and 2 valid IP addresses (one internal only and one
 external) for each machine.



 Is it possible now, to change the Public Network address, after the cluster
 is up and running?

 I had used Ceph-deploy for the cluster. If I change the address of the
 public network in Ceph.conf, do I need to propagate to all the machines in
 the cluster or just the Monitor Node is enough?

You'll need to change the config on each node and then restart it so
that the OSDs will bind to the new location. The OSDs will let you do
this on a rolling basis, but the networks will need to be routable to
each other.

Note that changing the addresses on the monitors (I can't tell if you
want to do that) is much more difficult; it's probably easiest to
remove one at a time from the cluster and then recreate it with its
new IP. (There are docs on how to do this.)
-Greg




 Thanks

 Pankaj


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?

2015-03-04 Thread Gregory Farnum
Just to get more specific: the reason you can apparently write stuff
to a file when you can't write to the pool it's stored in is because
the file data is initially stored in cache. The flush out to RADOS,
when it happens, will fail.

It would definitely be preferable if there was some way to immediately
return a permission or IO error in this case, but so far we haven't
found one; the relevant interfaces just aren't present and it's
unclear how to propagate the data back to users in a way that makes
sense even if they were. :/
-Greg

On Wed, Mar 4, 2015 at 3:37 AM, SCHAER Frederic frederic.sch...@cea.fr wrote:
 Hi,

 Many thanks for the explanations.
 I haven't used the nodcache option when mounting cephfs, it actually got 
 there by default

 My mount command is/was :
 # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret

 I don't know what causes this option to be default, maybe it's the kernel 
 module I compiled from git (because there is no kmod-ceph or kmod-rbd in any 
 RHEL-like distributions except RHEV), I'll try to update/check ...

 Concerning the rados pool ls, indeed : I created empty files in the pool, and 
 they were not showing up probably because they were just empty - but when I 
 create a non empty file, I see things in rados ls...

 Thanks again
 Frederic


 -Message d'origine-
 De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John 
 Spray
 Envoyé : mardi 3 mars 2015 17:15
 À : ceph-users@lists.ceph.com
 Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?



 On 03/03/2015 15:21, SCHAER Frederic wrote:

 By the way : looks like the ceph fs ls command is inconsistent when
 the cephfs is mounted (I used a locally compiled kmod-ceph rpm):

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ]

 (umount /mnt .)

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools:
 [puppet root ]

 This is probably #10288, which was fixed in 0.87.1

 So, I have this pool named root that I added in the cephfs filesystem.

 I then edited the filesystem xattrs :

 [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root

 ceph.dir.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=root

 I'm therefore assuming client.puppet should not be allowed to write or
 read anything in /mnt/root, which belongs to the root pool. but that
 is not the case.

 On another machine where I mounted cephfs using the client.puppet key,
 I can do this :

 The mount was done with the client.puppet key, not the admin one that
 is not deployed on that node :

 1.2.3.4:6789:/ on /mnt type ceph
 (rw,relatime,name=puppet,secret=hidden,nodcache)

 [root@dev7248 ~]# echo not allowed  /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 not allowed

 This is data you're seeing from the page cache, it hasn't been written
 to RADOS.

 You have used the nodcache setting, but that doesn't mean what you
 think it does (it was about caching dentries, not data).  It's actually
 not even used in recent kernels (http://tracker.ceph.com/issues/11009).

 You could try the nofsc option, but I don't know exactly how much
 caching that turns off -- the safer approach here is probably to do your
 testing using I/Os that have O_DIRECT set.

 And I can even see the xattrs inherited from the parent dir :

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=root

 Whereas on the node where I mounted cephfs as ceph admin, I get nothing :

 [root@ceph0 ~]# cat /mnt/root/secret.notfailed

 [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed

 -rw-r--r-- 1 root root 12 Mar  3 15:27 /mnt/root/secret.notfailed

 After some time, the file also gets empty on the puppet client host :

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 (but the metadata remained ?)

 Right -- eventually the cache goes away, and you see the true (empty)
 state of the file.

 Also, as an unpriviledged user, I can get ownership of a secret file
 by changing the extended attribute :

 [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet
 /mnt/root/secret.notfailed

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=puppet

 Well, you're not really getting ownership of anything here: you're
 modifying the file's metadata, which you are entitled to do (pool
 permissions have nothing to do with file metadata).  There was a recent
 bug where a file's pool layout could 

Re: [ceph-users] Does Ceph rebalance OSDs proportionally

2015-02-25 Thread Gregory Farnum
Yes. :)
-Greg
On Wed, Feb 25, 2015 at 8:33 AM Jordan A Eliseo jaeli...@us.ibm.com wrote:

 Hi all,

 Quick qestion, does the Crush map always strive for proportionality when
 rebalancing a cluster? i.e. Say I have 8 OSDs (with a two node cluster - 4
 OSDs per host - at ~90% utilization (which I know is bad, this is just
 hypothetical). Now if I add a total of 8 OSDs - 4 new OSDs for each host -
 will the crush map try to rebalance such that all disks have a utilization
 of 40-50%? Assumption being all disks are of equal size and weight.

 Regards,

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange 'ceph df' output

2015-02-25 Thread Gregory Farnum
IIRC these global values for total size and available are just summations
from the (programmatic equivalent) of running df on each machine locally,
but the used values are based on actual space used by each PG. That has
occasionally produced some odd results depending on how you've configured
your system and how that translates into df output. (Eg you might be using
up space for journals or your OS that aren't considered as used for the
purposes of RADOS' df.)
-Greg
On Wed, Feb 25, 2015 at 6:57 AM Kamil Kuramshin kamil.kurams...@tatar.ru
wrote:

  Cant find out why this can happen:
 Got an HEALTH_OK cluster. ceph version 0.87, all nodes are Debian Wheezy
 with a stable  kernel  3.2.65-1+deb7u1. ceph df shows me this:

 *$ ceph df*
 GLOBAL:
 SIZE AVAIL RAW USED %RAW USED
 *242T  221T8519G  3.43 *
 POOLS:
 NAME  ID USED  %USED MAX AVAIL OBJECTS
 rbd   2  1948G  0.7974902G  498856
 ec_backup-storage 4  0 0  146T   0
 cache 5  0 0  184G   0
 block-devices 6   827G  0.3374902G  211744

 Explanation:

 Total space = Used space + Available space:
 *242T ** 8,5T + **221T*, but MUST be equal is not it? Where I have lost
 aproxymately 12,5 Tb of space?

 *$ ceph -s*
 cluster 0745bec9-a7a7-4ee1-be5d-bb12db3cdd8f
  health HEALTH_OK
  monmap e1: 3 mons at {node04=
 10.0.0.14:6789/0,node05=10.0.0.15:6789/0,node06=10.0.0.16:6789/0},
 election epoch 48, quorum 0,1,2 node04,node05,node06
  osdmap e16866: 102 osds: 102 up, 102 in
   pgmap v570489: 10200 pgs, 4 pools, 2775 GB data, 693 kobjects
* 8518 GB used, 221 TB / 242 TB avail*
10200 active+clean

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Wrong object and used space count in cache tier pool

2015-02-24 Thread Gregory Farnum
On Tue, Feb 24, 2015 at 6:21 AM, Xavier Villaneau
xavier.villan...@fr.clara.net wrote:
 Hello ceph-users,

 I am currently making tests on a small cluster, and Cache Tiering is one of
 those tests. The cluster runs Ceph 0.87 Giant on three Ubuntu 14.04 servers
 with the 3.16.0 kernel, for a total of 8 OSD and 1 MON.

 Since there are no SSDs in those servers, I am testing Cache Tiering by
 using an erasure-coded pool as storage and a replicated pool as cache. The
 cache settings are the defaults ones you'll find in the documentation, and
 I'm using writeback mode. Also, to simulate the small size of cache data,
 the hot storage pool has a 1024MB space quota. Then I write 4MB chunks of
 data to the storage pool using 'rados bench' (with --no-cleanup).

 Here are my cache pool settings according to InkScope :
 pool15
 pool name   test1_ct-cache
 auid0
 type1 (replicated)
 size2
 min size1
 crush ruleset   0 (replicated_ruleset)
 pg num  512
 pg placement_num512
 quota max_bytes 1 GB
 quota max_objects   0
 flags names hashpspool,incomplete_clones
 tiers   none
 tier of 14 (test1_ec-data)
 read tier   -1
 write tier  -1
 cache mode  writeback
 cache target_dirty_ratio_micro  40 %
 cache target_full_ratio_micro   80 %
 cache min_flush_age 0 s
 cache min_evict_age 0 s
 target max_objects  0
 target max_bytes960 MB
 hit set_count   1
 hit set_period  3600 s
 hit set_params  target_size :0
 seed :   0
 type :   bloom
 false_positive_probability : 0.05

 I believe the tiering itself works well, I do see objects and bytes being
 transfered from the cache to the storage when I write data. I checked with
 'rados ls', and the object count in the cold storage is always right on
 spot. But it isn't in the cache, when I do 'ceph df' or 'rados df' the space
 and object counts do not match with 'rados ls', and are usually much larger
 :

 % ceph df
 …
 POOLS:
 NAME   ID USED   %USED MAX AVAIL OBJECTS
 …
 test1_ec-data  14  5576M  0.045G 1394
 test1_ct-cache 15   772M 0 7410G 250
 % rados -p test1_ec-data ls | wc -l
 1394
 % rados -p test1_ct-cache ls | wc -l
 56
 # And this corresponds to 220M of data in test1_ct-cache

 Not only it prevents me from knowing exactly what the cache is doing, but it
 is also this value that is applied for the quota. And I've seen writing
 operations fail because the space count had reached 1G, although I was quite
 sure there was enough space. The count does not correct itself over time,
 even by waiting overnight. The count only changes when I poke the pool by
 changing a setting or writing data, but remains wrong (and not by the same
 number of objects). The changes in object counts given by 'rados ls' in both
 pools match with the number of objects written by 'rados bench'.

 Does anybody know where this mismatch might come from ? Is there a way to
 see more details about what's going on ? Or is it the normal behavior of a
 cache pool when 'rados bench' is used ?

Well, I don't think the quota stuff is going to interact well with
caching pools; the size limits are implemented at different places in
the cache.

Similarly, rados ls definitely doesn't work properly on cache pools;
you shouldn't expect anything sensible to come out of it. Among other
things, there are whiteout objects in the cache pool (recording that
an object is known not to exist in the base pool) that won't be listed
in rados ls, and I'm sure there's other stuff too.

If you're trying to limit the cache pool size you want to do that with
the target size and dirty targets/limits.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS [WRN] getattr pAsLsXsFs failed to rdlock

2015-02-26 Thread Gregory Farnum
For everybody else's reference, this is addressed in
http://tracker.ceph.com/issues/10944. That kernel has several known
bugs.
-Greg

On Tue, Feb 24, 2015 at 12:02 PM, Ilja Slepnev islep...@gmail.com wrote:
 Dear All,

 Configuration of MDS and CephFS client is the same:
 OS: CentOS 7.0.1406
 ceph-0.87
 Linux 3.10.0-123.20.1.el7.centos.plus.x86_64
 dmesg: libceph: loaded (mon/osd proto 15/24)
 dmesg: ceph: loaded (mds proto 32)
 Using kernel ceph module, fstab mount options:
 defaults,_netdev,ro,noatime,name=admin,secret=hidden
 CephFS mount is exported by NFS.

 Problem:
 After period of light activity (reading files, listing dirs) one of the
 cephfs paths got stuck in directory listing process, on local machine and
 via NFS.

 Log messages on MDS (repeating):
 2015-02-24 16:02:41.564071 7fdb0055c700  0 log_channel(default) log [WRN] :
 9 slow requests, 1 included below; oldest blocked for  14463.448519 secs
 2015-02-24 16:02:41.564077 7fdb0055c700  0 log_channel(default) log [WRN] :
 slow request 1922.318256 seconds old, received at 2015-02-24
 15:30:39.245786: client_request(client.66401597:2440 getattr pAsLsXsFs
 #1002d68) currently failed to rdlock, waiting

 Could it be a broken metadata, or a bug? How to find out what is going
 wrong?

 Is there a workaround?

 WBR,
 Ilja Slepnev

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minor version difference between monitors and OSDs

2015-02-20 Thread Gregory Farnum
On Thu, Feb 19, 2015 at 8:30 PM, Christian Balzer ch...@gol.com wrote:

 Hello,

 I have a cluster currently at 0.80.1 and would like to upgrade it to
 0.80.7 (Debian as you can guess), but for a number of reasons I can't
 really do it all at the same time.

 In particular I would like to upgrade the primary monitor node first and
 the secondary ones as well as the OSDs later.

 Now my understanding and hope is that unless I change the config to add
 features that aren't present in 0.80.1, things should work just fine,
 especially given the main release note blurb about 0.80.7:

I don't think we test upgrades between that particular combination of
versions, but as a matter of policy there shouldn't be any issues
between point releases.

The release note is referring to the issue described at
http://tracker.ceph.com/issues/9419, which is indeed for pre-Firefly
to Firefly upgrades. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD not marked as down or out

2015-02-20 Thread Gregory Farnum
That's pretty strange, especially since the monitor is getting the
failure reports. What version are you running? Can you bump up the
monitor debugging and provide its output from around that time?
-Greg

On Fri, Feb 20, 2015 at 3:26 AM, Sudarshan Pathak sushan@gmail.com wrote:
 Hello everyone,

 I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 different
 locations). Each pool has 3 replication size with 2 copy in primary location
 and 1 copy at secondary location.

 Everything is running as expected but the osd are not marked as down when I
 poweroff a OSD server. It has been around an hour.
 I tried changing the heartbeat settings too.

 Can someone point me in right direction.

 OSD 0 log
 =
 2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720)
 2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907)
 2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119)
 2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)


 Ceph monitor log
 
 2015-02-20 16:49:16.831548 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.4
 192.168.100.35:6800/1305 is reporting failure:1
 2015-02-20 16:49:16.831593 7f416e4aa700  0 log_channel(cluster) log [DBG] :
 osd.2 192.168.100.33:6800/24431 reported failed by osd.4
 192.168.100.35:6800/1305
 2015-02-20 16:49:17.080314 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.3
 192.168.100.34:6800/1358 is reporting failure:1
 2015-02-20 16:49:17.080527 7f416e4aa700  0 log_channel(cluster) log [DBG] :
 osd.2 192.168.100.33:6800/24431 reported failed by osd.3
 192.168.100.34:6800/1358
 2015-02-20 16:49:17.420859 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.5
 192.168.100.36:6800/1359 is reporting failure:1


 #ceph osd stat
  osdmap e455: 6 osds: 6 up, 6 in


 #ceph -s
 cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
  health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs
 stuck unclean; 1 requests are blocked  32 sec; 1 mons down, quorum 1,2,3,4
 storage1,storage2,compute3,compute4
  monmap e1: 5 mons at
 {admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0},
 election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
  osdmap e455: 6 osds: 6 up, 6 in
   pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
 82443 MB used, 2682 GB / 2763 GB avail
 3122 active+clean
  528 remapped+peering



 Ceph.conf file

 [global]
 fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
 mon_initial_members = admin, storage1, storage2, compute3, compute4
 mon_host =
 192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true

 osd pool default size = 3
 osd pool default min size = 3

 osd pool default pg num = 300
 osd pool default pgp num = 300

 public network = 192.168.100.0/24

 rgw print continue = false
 rgw enable ops log = false

 mon osd report timeout = 60
 mon osd down out interval = 30
 mon osd min down reports = 2

 osd heartbeat grace = 10
 osd mon heartbeat interval = 20
 osd mon report interval max = 60
 osd mon ack timeout = 15

 mon osd min down reports = 2


 Regards,
 Sudarshan Pathak

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes (fwd)

2015-02-20 Thread Gregory Farnum
You can try searching the archives and tracker.ceph.com for hints
about repairing these issues, but your disk stores have definitely
been corrupted and it's likely to be an adventure. I'd recommend
examining your local storage stack underneath Ceph and figuring out
which part was ignoring barriers.
-Greg

On Fri, Feb 20, 2015 at 10:39 AM, Jeff j...@usedmoviefinder.com wrote:
 Should I infer from the silence that there is no way to recover from the

 FAILED assert(last_e.version.version  e.version.version) errors?

 Thanks,
 Jeff

 - Forwarded message from Jeff j...@usedmoviefinder.com -

 Date: Tue, 17 Feb 2015 09:16:33 -0500
 From: Jeff j...@usedmoviefinder.com
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Power failure recovery woes

 Some additional information/questions:

 Here is the output of ceph osd tree

 Some of the down OSD's are actually running, but are down. For example
 osd.1:

 root 30158  8.6 12.7 1542860 781288 ?  Ssl 07:47   4:40
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f

  Is there any way to get the cluster to recognize them as being up?  osd-1 has
 the FAILED assert(last_e.version.version  e.version.version) errors.

 Thanks,
  Jeff


 # idweight  type name   up/down reweight
 -1  10.22   root default
 -2  2.72host ceph1
 0   0.91osd.0   up  1
 1   0.91osd.1   down0
 2   0.9 osd.2   down0
 -3  1.82host ceph2
 3   0.91osd.3   down0
 4   0.91osd.4   down0
 -4  2.04host ceph3
 5   0.68osd.5   up  1
 6   0.68osd.6   up  1
 7   0.68osd.7   up  1
 8   0.68osd.8   down0
 -5  1.82host ceph4
 9   0.91osd.9   up  1
 10  0.91osd.10  down0
 -6  1.82host ceph5
 11  0.91osd.11  up  1
 12  0.91osd.12  up  1

 On 2/17/2015 8:28 AM, Jeff wrote:


  Original Message 
 Subject: Re: [ceph-users] Power failure recovery woes
 Date: 2015-02-17 04:23
 From: Udo Lembke ulem...@polarzone.de
 To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com

 Hi Jeff,
 is the osd /var/lib/ceph/osd/ceph-2 mounted?

 If not, does it helps, if you mounted the osd and start with
 service ceph start osd.2
 ??

 Udo

 Am 17.02.2015 09:54, schrieb Jeff:
 Hi,

 We had a nasty power failure yesterday and even with UPS's our small (5
 node, 12 OSD) cluster is having problems recovering.

 We are running ceph 0.87

 3 of our OSD's are down consistently (others stop and are restartable,
 but our cluster is so slow that almost everything we do times out).

 We are seeing errors like this on the OSD's that never run:

 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
 Operation not permitted

 We are seeing errors like these of the OSD's that run some of the time:

 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
 timeout)

 Does anyone have any suggestions on how to recover our cluster?

 Thanks!
   Jeff


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 - End forwarded message -

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running giant/hammer mds with firefly osds

2015-02-20 Thread Gregory Farnum
On Fri, Feb 20, 2015 at 3:50 AM, Luis Periquito periqu...@gmail.com wrote:
 Hi Dan,

 I remember http://tracker.ceph.com/issues/9945 introducing some issues with
 running cephfs between different versions of giant/firefly.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14257.html

Hmm, yeah, that's been fixed for a while but is still waiting to go
out in the next point release. :(

Beyond this bug, although the MDS doesn't have any new OSD
dependencies that could break things, we don't test cross-version
stuff like that at all except during upgrades. Some minimal testing on
your side should be enough to make sure it works, but if I were you
I'd try it on a test cluster first — the MDS is reporting a lot more
to the monitors in Giant and Hammer than it did in Firefly, and
everything should be good but there might be issues lurking in the
compatibility checks there.
-Greg


 So if you upgrade please be aware that you'll also have to update the
 clients.

 On Fri, Feb 20, 2015 at 10:33 AM, Dan van der Ster d...@vanderster.com
 wrote:

 Hi all,

 Back in the dumpling days, we were able to run the emperor MDS with
 dumpling OSDs -- this was an improvement over the dumpling MDS.

 Now we have stable firefly OSDs, but I was wondering if we can reap
 some of the recent CephFS developments by running a giant or ~hammer
 MDS with our firefly OSDs. Did anyone try that yet?

 Best Regards, Dan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mixed ceph versions

2015-02-25 Thread Gregory Farnum
On Wed, Feb 25, 2015 at 3:11 PM, Deneau, Tom tom.den...@amd.com wrote:
 I need to set up a cluster where the rados client (for running rados
 bench) may be on a different architecture and hence running a different
 ceph version from the osd/mon nodes.  Is there a list of which ceph
 versions work together for a situation like this?

The RADOS protocol is architecture-independent, and while we don't
test across a huge version divergence (mostly between LTS releases)
the client should also be compatible with pretty much anything you
have server-side.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-24 Thread Gregory Farnum
On Mon, Feb 23, 2015 at 8:59 AM, Chris Murray chrismurra...@gmail.com wrote:
 ... Trying to send again after reporting bounce backs to dreamhost ...
 ... Trying to send one more time after seeing mails come through the
 list today ...

 Hi all,

 First off, I should point out that this is a 'small cluster' issue and
 may well be due to the stretched resources. If I'm doomed to destroying
 and starting again, fair be it, but I'm interested to see if things can
 get up and running again.

 My experimental ceph cluster now has 5 nodes with 3 osds each. Some
 drives are big, some drives are small. Most are formatted with BTRFS and
 two are still formatted with XFS, which I intend to remove and recreate
 with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
 compression suits my use-case, so I'm prepared to stick with it while it
 matures. I had to set the following, to avoid osds dying as the IO was
 consumed by the snapshot creation and deletion process (as I understand
 it):

 filestore btrfs snap = false

 and the mount options look like this:

 osd mount options btrfs =
 rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo

 Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
 horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
 also a mon.

 My issue is: After adding the 15th osd, the cluster went into a spiral
 of destruction, with osds going down one after another. One might go
 down on occasion, and usually a start of the osd in question will remedy
 things. This time, though, it hasn't, and the problem appears to have
 become worse and worse. I've tried starting osds, restarting whole
 hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
 and noout. I've ceased rbd activity since it was getting blocked anyway.
 The cluster appears to now be 'stuck' in this state:

 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
 pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
 pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
 unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
 osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
  monmap e5: 5 mons at
 {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
 quorum 0,1,2,3,4 0,1,2,3,4
  osdmap e63276: 15 osds: 7 up, 15 in
 flags noup,nodown,noout
   pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
 4611 GB used, 871 GB / 5563 GB avail
 838948/2578420 objects degraded (32.537%)
3 down+remapped+peering
8 stale+active+degraded+remapped
   85 active+clean
1 stale+incomplete
 1088 stale+down+peering
  642 active+degraded+remapped
1 incomplete
   33 stale+remapped+peering
  135 down+peering
1 stale+degraded
1
 stale+active+degraded+remapped+wait_backfill+backfill_toofull
  854 active+remapped
  234 stale+active+degraded
4 active+degraded+remapped+backfill_toofull
   40 active+remapped+backfill_toofull
 1079 active+degraded
5 stale+active+clean
   74 stale+peering

 Take one of the nodes. It holds osds 12 (down  in), 13 (up  in) and 14
 (down  in).

 # ceph osd stat
  osdmap e63276: 15 osds: 7 up, 15 in
 flags noup,nodown,noout

 # ceph daemon osd.12 status
 no valid command found; 10 closest matches:
 config show
 help
 log dump
 get_command_descriptions
 git_version
 config set var val [val...]
 version
 2
 config get var
 0
 admin_socket: invalid command

 # ceph daemon osd.13 status
 { cluster_fsid: e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a,
   osd_fsid: d7794b10-2366-4c4f-bb4d-5f11098429b6,
   whoami: 13,
   state: active,
   oldest_map: 48214,
   newest_map: 63276,
   num_pgs: 790}

 # ceph daemon osd.14 status
 admin_socket: exception getting command descriptions: [Errno 111]
 Connection refused

 I'm assuming osds 12 and 14 are acting that way because they're not up,
 but why are they different?

Well, you below indicate that osd.14's log says it crashed on an
internal heartbeat timeout (usually, it got stuck waiting for disk IO
or the kernel/btrfs hung), so that's why. The osd.12 process exists
but isn't up; osd.14 doesn't even have a socket to connect to.


 In terms of logs, ceph-osd.12.log doesn't go beyond this:
 2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
 (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
 2015-02-22 10:38:29.639802 7fd24952c780  0
 filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
 

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-26 Thread Gregory Farnum
 comment on what might be causing this error for this osd?
 Many years ago, when ZFS was in its infancy, I had a dedup disaster
 which I thought would never end, but that just needed to do its thing
 before the pool came back to life. Could this be a similar scenario
 perhaps? Is the activity leading up to something, and BTRFS is slowly
 doing what Ceph is asking of it, or is it just going round and round in
 circles and I just can't see? :-)

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Chris Murray
 Sent: 25 February 2015 12:58
 To: Gregory Farnum
 Cc: ceph-users
 Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
 the cluster recover without help?

 Thanks Greg

 After seeing some recommendations I found in another thread, my
 impatience got the better of me, and I've start the process again, but
 there is some logic, I promise :-) I've copied the process from Michael
 Kidd, I believe, and it goes along the lines of:

 setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
 stopping all OSDs setting all OSDs down  out setting various options in
 ceph.conf to limit backfill activity etc starting all OSDs wait until
 all CPU settles to 0%  -- I am here unset the noup flag wait until all
 CPU settles to 0% unset the noin flag wait until all CPU settles to 0%
 unset the nobackfill flag wait until all CPU settles to 0% unset the
 norecover flag remove options from ceph.conf unset the noscrub flag
 unset the nodeep-scrub flag


 Currently, host CPU usage is approx the following, so something's
 changed, and I'm tempted to leave things a little longer before my next
 step, just in case CPU does eventually stop spinning. I read reports of
 things taking a while even with modern Xeons, so I suppose it's not
 outside the realms of possibility that an AMD Neo might take days to
 work things out. We're up to 24.5 hours now:

 192.168.12.25   20%
 192.168.12.26   1%
 192.168.12.27   15%
 192.168.12.28   1%
 192.168.12.29   12%

 Interesting, as 192.168.12.26 and .28 are the two which stopped spinning
 before I restarted this process too.

 The number of different states is slightly less confusing now, but not
 by much: :-)

 788386/2591752 objects degraded (30.419%)
   90 stale+active+clean
2 stale+down+remapped+peering
2 stale+incomplete
1
 stale+active+degraded+remapped+wait_backfill+backfill_toofull
1 stale+degraded
 1255 stale+active+degraded
   32 stale+remapped+peering
  773 stale+active+remapped
4 stale+active+degraded+remapped+backfill_toofull
 1254 stale+down+peering
  278 stale+peering
   33 stale+active+remapped+backfill_toofull
  563 stale+active+degraded+remapped

 Well, you below indicate that osd.14's log says it crashed on an
 internal heartbeat timeout (usually, it got stuck waiting for disk IO or
 the kernel/btrfs hung), so that's why. The osd.12 process exists but
 isn't up; osd.14 doesn't even have a socket to connect to.

 Ah, that does make sense, thank you.

 That's not what I'd expect to see (it appears to have timed out and
 not be recognizing it?) but I don't look at these things too often so
 maybe that's the normal indication that heartbeats are failing.

 I'm not sure what this means either. A google for heartbeat_map
 is_healthy  FileStore::op_tp thread had timed out after doesn't return
 much, but I did see this quote from Sage on what looks like a similar
 matter:

 - the filestore op_queue is blocked on the throttler (too much io
 queued)
 - the commit thread is also waiting for ops to finish
 - i see no actual thread processing the op_queue Usually that's
 because it hit a kernel bug and got killed.  Not sure what else would
 make that thread disappear...
 sage

 Oh!

 Also, you want to find out why they're dying. That's probably the root

 cause of your issues

 I have a sneaking suspicion it's BTRFS, but don't have the evidence or
 perhaps the knowledge to prove it. If XFS did compression, I'd go with
 that, but at the moment I need to rely on compression to solve the
 problem of reclaiming space *within* files which reside on ceph. As far
 as I remember, overwriting with zeros didn't re-do the thin provisioning
 on XFS, if that makes sense.

 Thanks again,
 Chris
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s slow return result

2015-03-27 Thread Gregory Farnum
Are all your monitors running? Usually a temporary hang means that the Ceph
client tries to reach a monitor that isn't up, then times out and contacts
a different one.

I have also seen it just be slow if the monitors are processing so many
updates that they're behind, but that's usually on a very unhappy cluster.
-Greg
On Fri, Mar 27, 2015 at 8:50 AM Chu Duc Minh chu.ducm...@gmail.com wrote:

 On my CEPH cluster, ceph -s return result quite slow.
 Sometimes it return result immediately, sometimes i hang few seconds
 before return result.

 Do you think this problem (ceph -s slow return) only relate to ceph-mon(s)
 process? or maybe it relate to ceph-osd(s) too?
 (i deleting a big bucket, .rgw.buckets, and ceph-osd(s) disk util quite
 high)

 Regards,
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Slow writes with 1MB files

2015-03-27 Thread Gregory Farnum
So this is exactly the same test you ran previously, but now it's on
faster hardware and the test is slower?

Do you have more data in the test cluster? One obvious possibility is
that previously you were working entirely in the MDS' cache, but now
you've got more dentries and so it's kicking data out to RADOS and
then reading it back in.

If you've got the memory (you appear to) you can pump up the mds
cache size config option quite dramatically from it's default 10.

Other things to check are that you've got an appropriately-sized
metadata pool, that you've not got clients competing against each
other inappropriately, etc.
-Greg

On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
almightybe...@gmail.com wrote:
 Opps I should have said that I am not just writing the data but copying it :

 time cp Small1/* Small2/*

 Thanks,

 BJ

 On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I did a Ceph cluster install 2 weeks ago where I was getting great
 performance (~= PanFS) where I could write 100,000 1MB files in 61
 Mins (Took PanFS 59 Mins). I thought I could increase the performance
 by adding a better MDS server so I redid the entire build.

 Now it takes 4 times as long to write the same data as it did before.
 The only thing that changed was the MDS server. (I even tried moving
 the MDS back on the old slower node and the performance was the same.)

 The first install was on CentOS 7. I tried going down to CentOS 6.6
 and it's the same results.
 I use the same scripts to install the OSDs (which I created because I
 can never get ceph-deploy to behave correctly. Although, I did use
 ceph-deploy to create the MDS and MON and initial cluster creation.)

 I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
 with rados bench -p cephfs_data 500 write --no-cleanup  rados bench
 -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)

 Could anybody think of a reason as to why I am now getting a huge regression.

 Hardware Setup:
 [OSDs]
 64 GB 2133 MHz
 Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
 40Gb Mellanox NIC

 [MDS/MON new]
 128 GB 2133 MHz
 Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
 40Gb Mellanox NIC

 [MDS/MON old]
 32 GB 800 MHz
 Dual Proc E5472  @ 3.00GHz (8 Cores)
 10Gb Intel NIC
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshots and fstrim with cache tiers ?

2015-03-27 Thread Gregory Farnum
On Wed, Mar 25, 2015 at 3:14 AM, Frédéric Nass
frederic.n...@univ-lorraine.fr wrote:
 Hello,


 I have a few questions regarding snapshots and fstrim with cache tiers.


 In the cache tier and erasure coding FAQ related to ICE 1.2 (based on
 Firefly), Inktank says Snapshots are not supported in conjunction with
 cache tiers.

 What are the risks of using snapshots with cache tiers ? Would this better
 not use it recommandation still be true with Giant or Hammer ?


 Regarding the fstrim command, it doesn't seem to work with cache tiers. The
 freed up blocks don't get back in the ceph cluster.
 Can someone confirm this ? Is there something we can do to get those freed
 up blocks back in the cluster ?

It does work, but there are two effects you're missing here:
1) The object can be deleted in the cache tier, but it won't get
deleted from the backing pool until it gets flushed out of the cache
pool. Depending on your workload this can take a while.
2) On erasure-coded pool, the OSD makes sure it can roll back a
certain number of operations per PG. In the case of deletions, this
means keeping the object data around for a while. This can also take a
while if you're not doing many operations. This has been discussed on
the list before; I think you'll want to look for a thread about
rollback and pg log size.
-Greg



 Also, can we run an fstrim task from the cluster side ? That is, without
 having to map and mount each rbd image or rely on the client to operate this
 task ?


 Best regards,


 --

 Frédéric Nass

 Sous-direction Infrastructures
 Direction du Numérique
 Université de Lorraine

 email : frederic.n...@univ-lorraine.fr
 Tél : +33 3 83 68 53 83

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Slow writes with 1MB files

2015-03-27 Thread Gregory Farnum
On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
almightybe...@gmail.com wrote:
 Yes it's the exact same hardware except for the MDS server (although I
 tried using the MDS on the old node).
 I have not tried moving the MON back to the old node.

 My default cache size is mds cache size = 1000
 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
 I created 2048 for data and metadata:
 ceph osd pool create cephfs_data 2048 2048
 ceph osd pool create cephfs_metadata 2048 2048


 To your point on clients competing against each other... how would I check 
 that?

Do you have multiple clients mounted? Are they both accessing files in
the directory(ies) you're testing? Were they accessing the same
pattern of files for the old cluster?

If you happen to be running a hammer rc or something pretty new you
can use the MDS admin socket to explore a bit what client sessions
there are and what they have permissions on and check; otherwise
you'll have to figure it out from the client side.
-Greg


 Thanks for the input!


 On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote:
 So this is exactly the same test you ran previously, but now it's on
 faster hardware and the test is slower?

 Do you have more data in the test cluster? One obvious possibility is
 that previously you were working entirely in the MDS' cache, but now
 you've got more dentries and so it's kicking data out to RADOS and
 then reading it back in.

 If you've got the memory (you appear to) you can pump up the mds
 cache size config option quite dramatically from it's default 10.

 Other things to check are that you've got an appropriately-sized
 metadata pool, that you've not got clients competing against each
 other inappropriately, etc.
 -Greg

 On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Opps I should have said that I am not just writing the data but copying it :

 time cp Small1/* Small2/*

 Thanks,

 BJ

 On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I did a Ceph cluster install 2 weeks ago where I was getting great
 performance (~= PanFS) where I could write 100,000 1MB files in 61
 Mins (Took PanFS 59 Mins). I thought I could increase the performance
 by adding a better MDS server so I redid the entire build.

 Now it takes 4 times as long to write the same data as it did before.
 The only thing that changed was the MDS server. (I even tried moving
 the MDS back on the old slower node and the performance was the same.)

 The first install was on CentOS 7. I tried going down to CentOS 6.6
 and it's the same results.
 I use the same scripts to install the OSDs (which I created because I
 can never get ceph-deploy to behave correctly. Although, I did use
 ceph-deploy to create the MDS and MON and initial cluster creation.)

 I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
 with rados bench -p cephfs_data 500 write --no-cleanup  rados bench
 -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)

 Could anybody think of a reason as to why I am now getting a huge 
 regression.

 Hardware Setup:
 [OSDs]
 64 GB 2133 MHz
 Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
 40Gb Mellanox NIC

 [MDS/MON new]
 128 GB 2133 MHz
 Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
 40Gb Mellanox NIC

 [MDS/MON old]
 32 GB 800 MHz
 Dual Proc E5472  @ 3.00GHz (8 Cores)
 10Gb Intel NIC
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum
Has the OSD actually been detected as down yet?

You'll also need to set that min size on your existing pools (ceph
osd pool pool set min_size 1 or similar) to change their behavior;
the config option only takes effect for newly-created pools. (Thus the
default.)

On Thu, Mar 26, 2015 at 1:29 PM, Lee Revell rlrev...@gmail.com wrote:
 I added the osd pool default min size = 1 to test the behavior when 2 of 3
 OSDs are down, but the behavior is exactly the same as without it: when the
 2nd OSD is killed, all client writes start to block and these
 pipe.(stuff).fault messages begin:

 2015-03-26 16:08:50.775848 7fce177fe700  0 monclient: hunting for new mon
 2015-03-26 16:08:53.781133 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce0c01d4f0).fault
 2015-03-26 16:09:00.009092 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
 192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802dd40).fault
 2015-03-26 16:09:12.013147 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802e9d0).fault
 2015-03-26 16:10:06.013113 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1801e600).fault
 2015-03-26 16:10:36.013166 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
 192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802ee50).fault

 Here is my ceph.conf:

 [global]
 fsid = db460aa2-5129-4aaa-8b2e-43eac727124e
 mon_initial_members = ceph-node-1
 mon_host = 192.168.122.121
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 osd pool default size = 3
 osd pool default min size = 1
 public network = 192.168.122.0/24


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-26 Thread Gregory Farnum
There have been bugs here in the recent past which have been fixed for
hammer, at least...it's possible we didn't backport it for the giant
point release. :(

But for users going forward that procedure should be good!
-Greg

On Thu, Mar 26, 2015 at 11:26 AM, Kyle Hutson kylehut...@ksu.edu wrote:
 For what it's worth, I don't think  being patient was the answer. I was
 having the same problem a couple of weeks ago, and I waited from before 5pm
 one day until after 8am the next, and still got the same errors. I ended up
 adding a new cephfs pool with a newly-created small pool, but was never
 able to actually remove cephfs altogether.

 On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
 wrote:

 On 03/25/2015 05:44 PM, Gregory Farnum wrote:

 On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
 wrote:

 Dear All,

 Please forgive this post if it's naive, I'm trying to familiarise myself
 with cephfs!

 I'm using Scientific Linux 6.6. with Ceph 0.87.1

 My first steps with cephfs using a replicated pool worked OK.

 Now trying now to test cephfs via a replicated caching tier on top of an
 erasure pool. I've created an erasure pool, cannot put it under the
 existing
 replicated pool.

 My thoughts were to delete the existing cephfs, and start again, however
 I
 cannot delete the existing cephfs:

 errors are as follows:

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing
 filesystem

 I've tried killing the ceph-mds process, but this does not prevent the
 above
 error.

 I've also tried this, which also errors:

 [root@ceph1 ~]# ceph mds stop 0
 Error EBUSY: must decrease max_mds or else MDS will immediately
 reactivate


 Right, so did you run ceph mds set_max_mds 0 and then repeating the
 stop command? :)


 This also fail...

 [root@ceph1 ~]# ceph-deploy mds destroy
 [ceph_deploy.conf][DEBUG ] found configuration file at:
 /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds
 destroy
 [ceph_deploy.mds][ERROR ] subcommand destroy not implemented

 Am I doing the right thing in trying to wipe the original cephfs config
 before attempting to use an erasure cold tier? Or can I just redefine
 the
 cephfs?


 Yeah, unfortunately you need to recreate it if you want to try and use
 an EC pool with cache tiering, because CephFS knows what pools it
 expects data to belong to. Things are unlikely to behave correctly if
 you try and stick an EC pool under an existing one. :(

 Sounds like this is all just testing, which is good because the
 suitability of EC+cache is very dependent on how much hot data you
 have, etc...good luck!
 -Greg


 many thanks,

 Jake Grimmett
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 Thanks for your help - much appreciated.

 The set_max_mds 0 command worked, but only after I rebooted the server,
 and restarted ceph twice. Before this I still got an
 mds active error, and so was unable to destroy the cephfs.

 Possibly I was being impatient, and needed to let mds go inactive? there
 were ~1 million files on the system.

 [root@ceph1 ~]# ceph mds set_max_mds 0
 max_mds = 0

 [root@ceph1 ~]# ceph mds stop 0
 telling mds.0 10.1.0.86:6811/3249 to deactivate

 [root@ceph1 ~]# ceph mds stop 0
 Error EEXIST: mds.0 not active (up:stopping)

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing filesystem

 There shouldn't be any other mds servers running..
 [root@ceph1 ~]# ceph mds stop 1
 Error EEXIST: mds.1 not active (down:dne)

 At this point I rebooted the server, did a service ceph restart twice.
 Shutdown ceph, then restarted ceph before this command worked:

 [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it

 Anyhow, I've now been able to create an erasure coded pool, with a
 replicated tier which cephfs is running on :)

 *Lots* of testing to go!

 Again, many thanks

 Jake

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Gregory Farnum
You shouldn't rely on rados ls when working with cache pools. It
doesn't behave properly and is a silly operation to run against a pool
of any size even when it does. :)

More specifically, rados ls is invoking the pgls operation. Normal
read/write ops will go query the backing store for objects if they're
not in the cache tier. pgls is different — it just tells you what
objects are present in the PG on that OSD right now. So any objects
which aren't in cache won't show up when listing on the cache pool.
-Greg

On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting data 
 throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Gregory Farnum
On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:

 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:

 That's a great idea. I know I can setup cinder (the openstack volume 
 manager) as a multi-backend manager and migrate from one backend to the 
 other, each backend linking to different pools of the same ceph cluster. 
 What bugs me though is that I'm pretty sure the image store, glance, 
 wouldn't let me do that. Additionally, since the compute component also has 
 its own ceph pool, I'm pretty sure it won't let me migrate the data through 
 openstack.
 Hm wouldn’t it be possible to do something similar ala:

 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
  # export $obj to local disk
  rados -p pool-wth-too-many-pgs get $obj
  # import $obj from local disk to new pool
  rados -p better-sized-pool put $obj
 done

You would also have issues with snapshots if you do this on an RBD
pool. That's unfortunately not feasible.
-Greg



 possible split/partition list of objects into multiple concurrent loops, 
 possible from multiple boxes as seems fit for resources at hand, cpu, memory, 
 network, ceph perf.

 /Steffen




 On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:
 On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:

 Lately I've been going back to work on one of my first ceph setup and now 
 I see that I have created way too many placement groups for the pools on 
 that setup (about 10 000 too many). I believe this may impact performances 
 negatively, as the performances on this ceph cluster are abysmal. Since it 
 is not possible to reduce the number of PGs in a pool, I was thinking of 
 creating new pools with a smaller number of PGs, moving the data from the 
 old pools to the new pools and then deleting the old pools.

 I haven't seen any command to copy objects from one pool to another. Would 
 that be possible? I'm using ceph for block storage with openstack, so 
 surely there must be a way to move block devices from a pool to another, 
 right?
 What I did a one point was going one layer higher in my storage 
 abstraction, and created new Ceph pools and used those for new storage 
 resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a 
 live migration of virtual disks there, assume you could do the same in 
 OpenStack.

 My 0.02$

 /Steffen


 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator
 GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Gregory Farnum
The procedure you've outlined won't copy snapshots, just the head
objects. Preserving the proper snapshot metadata and inter-pool
relationships on rbd images I think isn't actually possible when
trying to change pools.

On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote:

 On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:

 That's a great idea. I know I can setup cinder (the openstack volume
 manager) as a multi-backend manager and migrate from one backend to the
 other, each backend linking to different pools of the same ceph cluster.
 What bugs me though is that I'm pretty sure the image store, glance,
 wouldn't let me do that. Additionally, since the compute component also has
 its own ceph pool, I'm pretty sure it won't let me migrate the data through
 openstack.

 Hm wouldn’t it be possible to do something similar ala:

 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
 # export $obj to local disk
 rados -p pool-wth-too-many-pgs get $obj
 # import $obj from local disk to new pool
 rados -p better-sized-pool put $obj
 done


 You would also have issues with snapshots if you do this on an RBD
 pool. That's unfortunately not feasible.

 What isn’t possible, export-import objects out-and-in of pools or snapshots
 issues?

 /Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum
On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

Oh. You need to keep a quorum of your monitors running (just the
monitor processes, not of everything in the system) or nothing at all
is going to work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd
 pool set $f min_size 1; done
 set pool 0 min_size to 1
 set pool 1 min_size to 1
 set pool 2 min_size to 1
 set pool 3 min_size to 1
 set pool 4 min_size to 1
 set pool 5 min_size to 1
 set pool 6 min_size to 1
 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new mon
 2015-03-26 17:26:30.701099 7f4ec45f5700  0 -- 192.168.122.111:0/1007741 
 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0023490).fault
 2015-03-26 17:26:42.701154 7f4ec44f4700  0 -- 192.168.122.111:0/1007741 
 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0025440).fault

 And all writes block until I bring back an OSD.

 Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum
On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

A quorum is a strict majority of the total membership. 2 monitors can
form a quorum just fine if there are either 2 or 3 total membership.
(As long as those two agree on every action, it cannot be lost.)

We don't *recommend* configuring systems with an even number of
monitors, because it increases the number of total possible failures
without increasing the number of failures that can be tolerated. (3
monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

Well, the remaining OSD won't be able to process IO because it's lost
its peers, and it can't reach any monitors to do updates or get new
maps. (Monitors which are not in quorum will not allow clients to
connect.)
The clients will eventually stop serving IO if they know they can't
reach a monitor, although I don't remember exactly how that's
triggered.

In this particular case, though, the client probably just tried to do
an op against the dead osd, realized it couldn't, and tried to fetch a
map from the monitors. When that failed it went into search mode,
which is what the logs are showing you.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
 Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus
 the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1
 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size
 to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0
 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail;
 active+0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail;
 active+0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail;
 active+0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new
 mon
 2015-03-26 17:26:30.701099 7f4ec45f5700  0 --
 192.168.122.111:0/1007741 
 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0023490).fault
 2015-03-26 17:26:42.701154 7f4ec44f4700  0 --
 192.168.122.111:0/1007741 
 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0025440).fault

 And all writes block until I bring back an OSD.

 Lee
 ___
 ceph-users mailing list
 ceph

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum
On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ?
 1 monitor can form a quorum and should be sufficient for a cluster to run.

The whole point of the monitor cluster is to ensure a globally
consistent view of the cluster state that will never be reversed by a
different group of up nodes. If one monitor (out of three) could make
changes to the maps by itself, then there's nothing to prevent all
three monitors from staying up but getting a net split, and then each
issuing different versions of the osdmaps to whichever clients or OSDs
happen to be connected to them.

If you want to get down into the math proofs and things then the Paxos
papers do all the proofs. Or you can look at the CAP theorem about the
tradeoff between consistency and availability. The monitors are a
Paxos cluster and Ceph is a 100% consistent system.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)

 We don't *recommend* configuring systems with an even number of monitors, 
 because it increases the number of total possible failures without increasing 
 the number of failures that can be tolerated. (3 monitors requires 2 in 
 quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

 Well, the remaining OSD won't be able to process IO because it's lost its 
 peers, and it can't reach any monitors to do updates or get new maps. 
 (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.

 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from the 
 monitors. When that failed it went into search mode, which is what the logs 
 are showing you.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their
 behavior; the config option only takes effect for newly-created
 pools. (Thus the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to
 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6
 min_size to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/
 0 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum
On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 I think you got me wrong. I am not saying each monitor of a group of 3 should 
 be able to change the map. Here is the scenario.

 1. Cluster up and running with 3 mons (quorum of 3), all fine.

 2. One node (and mon) is down, quorum of 2 , still connecting.

 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should 
 still be able to connect. Isn't it ?

No. The monitors can't tell the difference between dead monitors, and
monitors they can't reach over the network. So they say there are
three monitors in my map; therefore it requires two to make any
change. That's the case regardless of whether all of them are
running, or only one.


 Cluster with single monitor is able to form a quorum and should be working 
 fine. So, why not in case of point 3 ?
 If this is the way Paxos works, should we say that in a cluster with say 3 
 monitors it should be able to tolerate only one mon failure ?

Yes, that is the case.


 Let me know if I am missing a point here.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:41 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ?
 1 monitor can form a quorum and should be sufficient for a cluster to run.

 The whole point of the monitor cluster is to ensure a globally consistent 
 view of the cluster state that will never be reversed by a different group of 
 up nodes. If one monitor (out of three) could make changes to the maps by 
 itself, then there's nothing to prevent all three monitors from staying up 
 but getting a net split, and then each issuing different versions of the 
 osdmaps to whichever clients or OSDs happen to be connected to them.

 If you want to get down into the math proofs and things then the Paxos papers 
 do all the proofs. Or you can look at the CAP theorem about the tradeoff 
 between consistency and availability. The monitors are a Paxos cluster and 
 Ceph is a 100% consistent system.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)

 We don't *recommend* configuring systems with an even number of
 monitors, because it increases the number of total possible failures
 without increasing the number of failures that can be tolerated. (3
 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
 etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

 Well, the remaining OSD won't be able to process IO because it's lost
 its peers, and it can't reach any monitors to do updates or get new
 maps. (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.

 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from 
 the monitors. When that failed it went into search mode, which is what the 
 logs are showing you.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly

Re: [ceph-users] error creating image in rbd-erasure-pool

2015-03-24 Thread Gregory Farnum
On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote:

 Hi Loic and Markus,
 By the way, Inktank do not support snapshot of a pool with cache tiering :

* 
 https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf

 Hi,

 You seem to be talking about pool snapshots rather than RBD snapshots.  But 
 in the linked document it is not clear that there is a distinction:

 Can I use snapshots with a cache tier?
 Snapshots are not supported in conjunction with cache tiers.

 Can anyone clarify if this is just pool snapshots?

I think that was just a decision based on the newness and complexity
of the feature for product purposes. Snapshots against cache tiered
pools certainly should be fine in Giant/Hammer and we can't think of
any issues in Firefly off the tops of our heads.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does crushtool --test --simulate do what cluster should do?

2015-03-24 Thread Gregory Farnum
On Tue, Mar 24, 2015 at 10:48 AM, Robert LeBlanc rob...@leblancnet.us wrote:
 I'm not sure why crushtool --test --simulate doesn't match what the
 cluster actually does, but the cluster seems to be executing the rules
 even though crushtool doesn't. Just kind of stinks that you have to
 test the rules on actual data.

 Should I create a ticket for this?

Yes please! I'm not too familiar with the crushtool internals but the
simulator code hasn't had too many eyeballs so it's hopefully not too
hard a bug to fix.


 On Mon, Mar 23, 2015 at 6:08 PM, Robert LeBlanc rob...@leblancnet.us wrote:
 I'm trying to create a CRUSH ruleset and I'm using crushtool to test
 the rules, but it doesn't seem to mapping things correctly. I have two
 roots, on for spindles and another for SSD. I have two rules, one for
 each root. The output of crushtool on rule 0 shows objects being
 mapped to SSD OSDs when it should only be choosing spindles.

 I'm pretty sure I'm doing something wrong. I've tested the map on .93 and 
 .80.8.

 The map is at http://pastebin.com/BjmuASX0

 when running

 crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate --show-mappings

 I'm getting mapping to OSDs  39 which are SSDs. The same happens when
 I run the SSD rule, I get OSDs from both roots. It is as if crushtool
 is not selecting the correct root. In fact both rules result in the
 same mapping:

 RNG rule 0 x 0 [0,38,23]
 RNG rule 0 x 1 [10,25,1]
 RNG rule 0 x 2 [11,40,0]
 RNG rule 0 x 3 [5,30,26]
 RNG rule 0 x 4 [44,30,10]
 RNG rule 0 x 5 [8,26,16]
 RNG rule 0 x 6 [24,5,36]
 RNG rule 0 x 7 [38,10,9]
 RNG rule 0 x 8 [39,9,23]
 RNG rule 0 x 9 [12,3,24]
 RNG rule 0 x 10 [18,6,41]
 ...

 RNG rule 1 x 0 [0,38,23]
 RNG rule 1 x 1 [10,25,1]
 RNG rule 1 x 2 [11,40,0]
 RNG rule 1 x 3 [5,30,26]
 RNG rule 1 x 4 [44,30,10]
 RNG rule 1 x 5 [8,26,16]
 RNG rule 1 x 6 [24,5,36]
 RNG rule 1 x 7 [38,10,9]
 RNG rule 1 x 8 [39,9,23]
 RNG rule 1 x 9 [12,3,24]
 RNG rule 1 x 10 [18,6,41]
 ...


 Thanks,
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS

2015-03-26 Thread Gregory Farnum
On Wed, Mar 25, 2015 at 8:10 PM, Ridwan Rashid Noel ridwan...@gmail.com wrote:
 Hi Greg,

 Thank you for your response. I have understood that I should be starting
 only the mapred daemons when using cephFS instead of HDFS. I have fixed that
 and trying to run hadoop wordcount job using this instruction:

 bin/hadoop jar hadoop*examples*.jar wordcount /tmp/wc-input /tmp/wc-output

 but I am getting this error

 15/03/26 02:54:35 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
 15/03/26 02:54:35 INFO input.FileInputFormat: Total input paths to process :
 1
 15/03/26 02:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded
 15/03/26 02:54:35 INFO mapred.JobClient: Running job: job_201503260253_0001
 15/03/26 02:54:36 INFO mapred.JobClient:  map 0% reduce 0%
 15/03/26 02:54:36 INFO mapred.JobClient: Task Id :
 attempt_201503260253_0001_m_21_0, Status : FAILED
 Error initializing attempt_201503260253_0001_m_21_0:
 java.io.FileNotFoundException: File
 file:/tmp/hadoop-ceph/mapred/system/job_201503260253_0001/jobToken does not
 exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
 org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:4445)
 at
 org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1272)
 at
 org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1213)
 at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2568)
 at java.lang.Thread.run(Thread.java:745)

I'm not an expert at setting up Hadoop, but these errors are coming
out of the RawLocalFileSystem, which I think means that worker node
is trying to use a local FS instead of Ceph. Did you set up each node
to access Ceph? Have you set up and used Hadoop previously?
-Greg


 .

 I have used the core-site.xml configurations as mentioned in
 http://ceph.com/docs/master/cephfs/hadoop/
 Please tell me how can this problem be solved?

 Regards,

 Ridwan Rashid Noel

 Doctoral Student,
 Department of Computer Science,
 University of Texas at San Antonio

 Contact# 210-773-9966

 On Fri, Mar 20, 2015 at 4:04 PM, Gregory Farnum g...@gregs42.com wrote:

 On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com
 wrote:
  Gregory Farnum greg@... writes:
 
 
  On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote:
   Hi,
  
   I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop
   with
   cephFS. I have installed hadoop-1.1.1 in the nodes and changed the
   conf/core-site.xml file according to the ceph documentation
   http://ceph.com/docs/master/cephfs/hadoop/ but after changing the
   file the
   namenode is not starting (namenode can be formatted) but the other
   services(datanode, jobtracker, tasktracker) are running in hadoop.
  
   The default hadoop works fine but when I change the core-site.xml
   file as
   above I get the following bindException as can be seen from the
   namenode
  log:
  
  
   2015-03-19 01:37:31,436 ERROR
   org.apache.hadoop.hdfs.server.namenode.NameNode:
   java.net.BindException:
   Problem binding to node1/10.242.144.225:6789 : Cannot assign
   requested
  address
  
  
   I have one monitor for the ceph cluster (node1/10.242.144.225) and I
   included in the core-site.xml file ceph://10.242.144.225:6789 as the
   value
   of fs.default.name. The 6789 port is the default port being used by
   the
   monitor node of ceph, so that may be the reason for the bindException
   but
   the ceph documentation mentions that it should be included like this
   in the
   core-site.xml file. It would be really helpful to get some pointers
   to where
   I am doing wrong in the setup.
 
  I'm a bit confused. The NameNode is only used by HDFS, and so
  shouldn't be running at all if you're using CephFS. Nor do I have any
  idea why you've changed anything in a way that tells the NameNode to
  bind to the monitor's IP address; none of the instructions that I see
  can do that, and they certainly shouldn't be.
  -Greg
 
 
  Hi Greg,
 
  I want to run a hadoop job (e.g. terasort) and want to use cephFS
  instead of
  HDFS. In Using Hadoop with cephFS documentation in
  http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop
  configuration section, the first property fs.default.name has to be set
  as
  the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/.
  My
  core-site.xml of hadoop conf looks like this
 
  configuration
 
  property
  namefs.default.name/name
  valueceph://10.242.144.225:6789/value
  /property

 Yeah, that all makes sense. But I don't understand why or how you're
 starting up a NameNode at all, nor what config values it's drawing
 from to try and bind to that port. The NameNode is the problem because
 it shouldn't even be invoked.
 -Greg

<    4   5   6   7   8   9   10   11   12   13   >