[ceph-users] Replace corrupt journal
Hi, I have a problem starting a couple of OSDs because of the journal being corrupt. Is there any way to replace the journal and keeping the rest of the OSD intact. -1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read past sequence 8188178 but header indicates the journal has committed up through 8188206, journal is corrupt 0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)' thread 7fb32df86900 time 2015-01-11 16:02:54.475276 os/FileJournal.cc: 1693: FAILED assert(0) I ended up in this situation when osd.9 on host orange went down, and then I had a powerfailure on the host purple which made 2 of my journals corrupt. -3 6 host purple 4 1 osd.4 up 1 5 1 osd.5 down0 7 2 osd.7 down0 6 2 osd.6 up 1 -4 6 host orange 8 1 osd.8 up 1 9 1 osd.9 down0 The filesystem was not in use by users, but it was replicating when the host went down and I figure that I still have the data on the OSD-disks, they are still mountable and the XFS-filesystem on them seems to be intact. Thanks, Claes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph MeetUp Berlin
Hi, the next MeetUp in Berlin takes place on January 26 at 18:00 CET. Our host is Deutsche Telekom, they will hold a short presentation about their OpenStack / CEPH based production system. Please RSVP at http://www.meetup.com/Ceph-Berlin/events/218939774/ Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode
(resending to list) Hi Kyle, I'd like to +10 this old proposal of yours. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway (wrongly marked me down...). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for 30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,1). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all ceph command on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Cheers, Dan On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote: It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues: Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with numactl --interleave=all. This should probably be activated by a flag in /etc/default/ceph and modifying the ceph-osd.conf upstart script, along with adding a depend to the
[ceph-users] error adding OSD to crushmap
Hi all, I've been trying to add a few new OSDs, and as I manage everything with puppet, it was manually adding via the CLI. At one point it adds the OSD to the crush map using: # ceph osd crush add 6 0.0 root=default but I get Error ENOENT: osd.6 does not exist. create it before updating the crush map If I read correctly this command should be the correct one to create the OSD to the crush map... is this a bug? I'm running the latest firefly 0.80.7. thanks PS: I just edited the crushmap, but it would make it a lot easier to do it by the CLI commands... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NUMA zone_reclaim_mode
(apologies if you receive this more than once... apparently I cannot reply to a 1 year old message on the list). Dear all, I'd like to +10 this old proposal of Kyle's. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway (wrongly marked me down...). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for 30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,1). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all ceph command on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Cheers, Dan On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote: It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues: Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with numactl --interleave=all. This should probably be activated by a flag in
Re: [ceph-users] NUMA zone_reclaim_mode
On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system... Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs modification time
What versions of all the Ceph pieces are you using? (Kernel client/ceph-fuse, MDS, etc) Can you provide more details on exactly what the program is doing on which nodes? -Greg On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote: first 3 stat commands shows blocks and size changing, but not the times after a touch it changes and tail works I saw some cephfs freezes related to it, it came back after touching the files coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 148564 Blocks: 291IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:00.100582619 + Modify: 2015-01-10 01:13:00.100582619 + Change: 2015-01-10 01:13:00.0 + Birth: - coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 152633 Blocks: 299IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:00.100582619 + Modify: 2015-01-10 01:13:00.100582619 + Change: 2015-01-10 01:13:00.0 + Birth: - coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 155763 Blocks: 305IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:00.100582619 + Modify: 2015-01-10 01:13:00.100582619 + Change: 2015-01-10 01:13:00.0 + Birth: - coreos2 logs # touch deis-router.log coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 155763 Blocks: 305IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:46.961858103 + Modify: 2015-01-10 01:13:46.961858103 + Change: 2015-01-10 01:13:46.0 + Birth: - On Fri, Jan 9, 2015 at 11:11 PM, Lorieri lori...@gmail.com wrote: Hi, I have a program that tails a file and this file is create on another machine some tail programs does not work because the modification time is not updated in the remote machines I've find this old thread http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/11001 it mentions the problem and suggest ntp sync I tried to re-sync ntp and restart the ceph cluster, but the issue persists do you know if it is possible to avoid this behavior ? thanks -lorieri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] the performance issue for cache pool
Hi everyone: I used writeback mode for cache pool : ceph osd tier add sas ssd ceph osd tier add sas ssd ceph osd tier cache-mode ssd writeback ceph osd tier set-overlay sas ssd and i also set dirty ratio and full ratio: ceph osd pool set ssd cache_target_dirty_ratio .4 ceph osd pool set ssd cache_target_full_ratio .8 the capacity of ssd cache pool is 4T. I used fio to test performance: fio -filename=/dev/rbd0 -direct=1 -iodepth 32 -thread -rw=randwrite -ioengine=libaio -bs=16M -size=2000G -group_reporting -name=mytest at the begin, the performance is very good, but after half a hour, I find when the hot cache pool begin flushing dirty objects, the performance of rados is instability. from 87851 kB/s to 860 MB/s. Do have any tunning parameters to get more stable performance? Thanks. 2014-12-23 22:46:24.844730 mon.0 [INF] pgmap v24101: 6144 pgs: 6144 active+clean; 1246 GB data, 4012 GB used, 45109 GB / 49121 GB avail; 680 MB/s wr, 1007 op/s 2014-12-23 22:46:27.851431 mon.0 [INF] pgmap v24102: 6144 pgs: 6144 active+clean; 1246 GB data, 4012 GB used, 45109 GB / 49121 GB avail; 161 MB/s wr, 299 op/s 2014-12-23 22:46:28.883866 mon.0 [INF] pgmap v24103: 6144 pgs: 6144 active+clean; 1247 GB data, 4015 GB used, 45106 GB / 49121 GB avail; 308 MB/s wr, 1065 op/s 2014-12-23 22:46:29.885914 mon.0 [INF] pgmap v24104: 6144 pgs: 6144 active+clean; 1247 GB data, 4016 GB used, 45105 GB / 49121 GB avail; 701 MB/s wr, 1621 op/s 2014-12-23 22:46:32.842955 mon.0 [INF] pgmap v24105: 6144 pgs: 6144 active+clean; 1247 GB data, 4016 GB used, 45105 GB / 49121 GB avail; 116 MB/s wr, 160 op/s 2014-12-23 22:46:33.863964 mon.0 [INF] pgmap v24106: 6144 pgs: 6144 active+clean; 1248 GB data, 4021 GB used, 45100 GB / 49121 GB avail; 344 MB/s wr, 923 op/s 2014-12-23 22:46:34.861011 mon.0 [INF] pgmap v24107: 6144 pgs: 6144 active+clean; 1248 GB data, 4021 GB used, 45100 GB / 49121 GB avail; 706 MB/s wr, 1564 op/s 2014-12-23 22:46:38.176885 mon.0 [INF] pgmap v24108: 6144 pgs: 6144 active+clean; 1249 GB data, 4024 GB used, 45097 GB / 49121 GB avail; 222 MB/s wr, 938 op/s 2014-12-23 22:46:39.177233 mon.0 [INF] pgmap v24109: 6144 pgs: 6144 active+clean; 1250 GB data, 4026 GB used, 45095 GB / 49121 GB avail; 427 MB/s wr, 1292 op/s 2014-12-23 22:46:42.842279 mon.0 [INF] pgmap v24110: 6144 pgs: 6144 active+clean; 1250 GB data, 4026 GB used, 45095 GB / 49121 GB avail; 320 MB/s wr, 570 op/s 2014-12-23 22:46:43.872017 mon.0 [INF] pgmap v24111: 6144 pgs: 6144 active+clean; 1251 GB data, 4030 GB used, 45090 GB / 49121 GB avail; 405 MB/s wr, 992 op/s 2014-12-23 22:46:44.862873 mon.0 [INF] pgmap v24112: 6144 pgs: 6144 active+clean; 1251 GB data, 4030 GB used, 45090 GB / 49121 GB avail; 729 MB/s wr, 1755 op/s 2014-12-23 22:46:47.847813 mon.0 [INF] pgmap v24113: 6144 pgs: 6144 active+clean; 1251 GB data, 4031 GB used, 45090 GB / 49121 GB avail; 2053 kB/s wr, 135 op/s 2014-12-23 22:46:48.857285 mon.0 [INF] pgmap v24114: 6144 pgs: 6144 active+clean; 1252 GB data, 4033 GB used, 45087 GB / 49121 GB avail; 272 MB/s wr, 433 op/s 2014-12-23 22:46:49.871775 mon.0 [INF] pgmap v24115: 6144 pgs: 6144 active+clean; 1252 GB data, 4034 GB used, 45087 GB / 49121 GB avail; 535 MB/s wr, 586 op/s 2014-12-23 22:46:52.842098 mon.0 [INF] pgmap v24116: 6144 pgs: 6144 active+clean; 1252 GB data, 4033 GB used, 45088 GB / 49121 GB avail; 3074 kB/s wr, 113 op/s 2014-12-23 22:46:53.845398 mon.0 [INF] pgmap v24117: 6144 pgs: 6144 active+clean; 1254 GB data, 4037 GB used, 45084 GB / 49121 GB avail; 342 MB/s wr, 571 op/s 2014-12-23 22:46:57.844137 mon.0 [INF] pgmap v24118: 6144 pgs: 6144 active+clean; 1254 GB data, 4037 GB used, 45084 GB / 49121 GB avail; 302 MB/s wr, 577 op/s 2014-12-23 22:46:58.848028 mon.0 [INF] pgmap v24119: 6144 pgs: 6144 active+clean; 1255 GB data, 4039 GB used, 45082 GB / 49121 GB avail; 319 MB/s wr, 897 op/s 2014-12-23 22:47:02.844724 mon.0 [INF] pgmap v24120: 6144 pgs: 6144 active+clean; 1255 GB data, 4039 GB used, 45082 GB / 49121 GB avail; 327 MB/s wr, 856 op/s 2014-12-23 22:47:03.850795 mon.0 [INF] pgmap v24121: 6144 pgs: 6144 active+clean; 1256 GB data, 4043 GB used, 45078 GB / 49121 GB avail; 297 MB/s wr, 887 op/s 2014-12-23 22:47:08.169046 mon.0 [INF] pgmap v24122: 6144 pgs: 6144 active+clean; 1256 GB data, 4045 GB used, 45076 GB / 49121 GB avail; 318 MB/s wr, 830 op/s 2014-12-23 22:47:09.169302 mon.0 [INF] pgmap v24123: 6144 pgs: 6144 active+clean; 1257 GB data, 4046 GB used, 45075 GB / 49121 GB avail; 133 MB/s wr, 257 op/s 2014-12-23 22:47:12.844073 mon.0 [INF] pgmap v24124: 6144 pgs: 6144 active+clean; 1257 GB data, 4046 GB used, 45075 GB / 49121 GB avail; 65702 kB/s wr, 124 op/s 2014-12-23 22:47:13.845286 mon.0 [INF] pgmap v24125: 6144 pgs: 6144 active+clean; 1257 GB data, 4047 GB used, 45074 GB / 49121 GB avail; 142 MB/s wr, 284 op/s 2014-12-23 22:47:14.846753 mon.0 [INF] pgmap v24126: 6144 pgs: 6144 active+clean; 1257 GB data, 4047 GB used, 45074 GB / 49121 GB avail; 461
[ceph-users] unsubscribe
unsubscribe Regards, -don- -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problem with Rados gateway
Scenario: Openstack Juno RDO on Centos7. Ceph version: Giant. On Centos7 there isn't more the old fastcgi, but there's mod_fcgid The apache VH is the following: VirtualHost *:8080 ServerName rdo-ctrl01 DocumentRoot /var/www/radosgw RewriteEngine On RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) /s3gw.fcgi?page=$1params=$2%{QUERY_STRING} [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] Directory /var/www/radosgw Options +ExecCGI AllowOverride All SetHandler fcgid-script Order allow,deny Allow from all AuthBasicAuthoritative Off /Directory AllowEncodedSlashes On ErrorLog /var/log/httpd/error.log CustomLog /var/log/httpd/access.log combined ServerSignature Off /VirtualHost On /var/www/radosgw there's the cgi file s3gw.fcgi: #!/bin/sh exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway -d --debug-rgw 20 --debug-ms 1 For the configuration I've followed this documentation: http://docs.ceph.com/docs/next/radosgw/config/ When I try to access to the object storage I've got the following errors: 1) Apache VH error: [Wed Jan 07 13:15:22.029411 2015] [fcgid:info] [pid 2051] mod_fcgid: server rdo-ctrl01:/var/www/radosgw/s3gw.fcgi(28527) started 2015-01-07 13:15:22.046644 7ff16e240880 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process radosgw, pid 28527 2015-01-07 13:15:22.053673 7ff16e240880 1 -- :/0 messenger.start 2015-01-07 13:15:22.054783 7ff16e240880 1 -- :/1028527 -- 163.162.90.120:6789/0 -- auth(proto 0 40 bytes epoch 0) v1 -- ?+0 0x11d9100 con 0x11a0870 2015-01-07 13:15:22.055339 7ff16e238700 1 -- 163.162.90.120:0/1028527 learned my addr 163.162.90.120:0/1028527 2015-01-07 13:15:22.056425 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 1 mon_map magic: 0 v1 200+0+0 (3839442293 0 0) 0x7ff148000ab0 con 0x11a0870 2015-01-07 13:15:22.056547 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (3991100068 0 0) 0x7ff148000f70 con 0x11a0870 2015-01-07 13:15:22.056900 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7ff14c0012e0 con 0x11a0870 2015-01-07 13:15:22.057505 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 222+0+0 (1145796146 0 0) 0x7ff148000f70 con 0x11a0870 2015-01-07 13:15:22.057768 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- ?+0 0x7ff14c001ca0 con 0x11a0870 2015-01-07 13:15:22.058496 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 425+0+0 (2903986998 0 0) 0x7ff148001200 con 0x11a0870 2015-01-07 13:15:22.058694 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x11d94c0 con 0x11a0870 2015-01-07 13:15:22.058843 7ff16e240880 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x11d91d0 con 0x11a0870 2015-01-07 13:15:22.058934 7ff16e240880 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x11d9ab0 con 0x11a0870 2015-01-07 13:15:22.059214 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 5 mon_map magic: 0 v1 200+0+0 (3839442293 0 0) 0x7ff148001130 con 0x11a0870 2015-01-07 13:15:22.059140 7ff1567fc700 2 RGWDataChangesLog::ChangesRenewThread: start 2015-01-07 13:15:22.059737 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 (1877860257 0 0) 0x7ff148001410 con 0x11a0870 2015-01-07 13:15:22.059869 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 7 osd_map(52..52 src has 1..52) v3 5987+0+0 (3066791464 0 0) 0x7ff148002d50 con 0x11a0870 2015-01-07 13:15:22.060250 7ff16e240880 20 get_obj_state: rctx=0x119c2f0 obj=.rgw.root:default.region state=0x119dba8 s-prefetch_data=0 2015-01-07 13:15:22.060302 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 8 mon_subscribe_ack(300s) v1 20+0+0 (1877860257 0 0) 0x7ff148001130 con 0x11a0870 2015-01-07 13:15:22.060325 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 9 osd_map(52..52 src has 1..52) v3 5987+0+0 (3066791464 0 0) 0x7ff1480046f0 con 0x11a0870 2015-01-07 13:15:22.060333 7ff16e240880 10 cache get: name=.rgw.root+default.region : miss 2015-01-07 13:15:22.060342 7ff15e7fc700 1 -- 163.162.90.120:0/1028527 == mon.0 163.162.90.120:6789/0 10 mon_subscribe_ack(300s) v1 20+0+0 (1877860257 0 0) 0x7ff148004bb0 con 0x11a0870 2015-01-07 13:15:22.060444 7ff16e240880 1 -- 163.162.90.120:0/1028527 -- 163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=53}) v2 -- ?+0 0x119eaf0 con 0x11a0870 2015-01-07 13:15:22.060805
Re: [ceph-users] rbd directory listing performance issues
Hi, I am just wondering if anyone has any thoughts on the questions below...I would like to order some additional hardware ASAP...and the order that I place may change depending on the feedback that I receive. Thanks again, Shain Sent from my iPhone On Jan 9, 2015, at 2:45 PM, Shain Miley smi...@npr.org wrote: Although it seems like having a regularly scheduled cron job to do a recursive directory listing may be ok for us as a bit of a work around...I am still in the processes of trying to improve performance. A few other questions have come up as a result. a)I am in the process of looking at specs for a new rbd 'headnode' that will be used to mount our 100TB rbd image. At some point in the future we may look into the performance, and multi client access that cephfs could offer...is there any reason that I would not be able to use this new server as both an rbd client and an mds server (assuming the hardware is good enough)? I know that some cluster functions should not and cannot be mixed on the same server...is this by any chance one of them? b)Currently the 100TB rbd image is acting as one large repository for our archivethis will only grow over time. I understand that ceph is pool based...however I am wondering if I would somehow see any better per rbd image performance...if for example...instead of having 1 x 100TB rbd image...I had 4 x 25TB rbd images (since we really could split these up based on our internal groups). c)Would adding a few ssd drives (in the right quantity) to each node help out with reads as well as writes? d)I am a bit confused about how to enable the rbd cache option on the client...is this change something that only needs to be made to the ceph.conf file on the rbd kernel client server...or do the mds and osd servers need the ceph.conf file modified as well and their services restarted? Other options that I might be looking into going forward are moving some of this data (the data actually needed by our php apps) to rgw...although that option adds some more complexity and unfamiliarity for our users. Thanks again for all the help so far. Shain On 01/07/2015 03:40 PM, Shain Miley wrote: Just to follow up on this thread, the main reason that the rbd directory listing latency was an issue for us, was that we were seeing a large amount of IO delay in a PHP app that reads from that rbd image. It occurred to me (based on Roberts cache_dir suggestion below) that maybe doing a recursive find or a recursive directory listing inside the one folder in question might speed things up. After doing the recursive find...the directory listing seems much faster and the responsiveness of the PHP app has increased as well. Hopefully nothing else will need to be done here, however it seems that worst case...a daily or weekly cronjob that traverses the directory tree in that folder might be all we need. Thanks again for all the help. Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Shain Miley [smi...@npr.org] Sent: Tuesday, January 06, 2015 8:16 PM To: Christian Balzer; ceph-us...@ceph.com Subject: Re: [ceph-users] rbd directory listing performance issues Christian, Each of the OSD's server nodes are running on Dell R-720xd's with 64 GB or RAM. We have 107 OSD's so I have not checked all of them..however the ones I have checked with xfs_db, have shown anywhere from 1% to 4% fragmentation. I'll try to upgrade the client server to 32 or 64 GB of ram at some point soon...however at this point all the tuning that I have done has not yielded all that much in terms of results. It maybe a simple fact that I need to look into adding some SSD's, and the overall bottleneck here are the 4TB 7200 rpm disks we are using. In general, when looking at the graphs in Calamari, we see around 20ms latency (await) for our OSD's however there are lots of times where we see (via the graphs) spikes of 250ms to 400ms as well. Thanks again, Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: Christian Balzer [ch...@gol.com] Sent: Tuesday, January 06, 2015 7:34 PM To: ceph-us...@ceph.com Cc: Shain Miley Subject: Re: [ceph-users] rbd directory listing performance issues Hello, On Tue, 6 Jan 2015 15:29:50 + Shain Miley wrote: Hello, We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of 107 x 4TB drives formatted with xfs. The cluster is running ceph version 0.80.7: I assume journals on the same HDD then. How much memory per node? [snip] A while back I created an 80 TB rbd image to be used as an archive repository for some of our audio and video
Re: [ceph-users] ceph on peta scale
On Mon, Jan 12, 2015 at 3:55 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: Thanks Greg, No i am more into large scale RADOS system not filesystem . however for geographic distributed datacentres specially when network flactuate how to handle that as i read it seems CEPH need big pipe of network Ceph isn't really suited for WAN-style distribution. Some users have high-enough and consistent-enough bandwidth (with low enough latency) to do it, but otherwise you probably want to use Ceph within the data centers and layer something else on top of it. -Greg /Zee On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote: On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: I just finished configuring ceph up to 100 TB with openstack ... Since we are also using Lustre in our HPC machines , just wondering what is the bottle neck in ceph going on Peta Scale like Lustre . any idea ? or someone tried it If you're talking about people building a petabyte Ceph system, there are *many* who run clusters of that size. If you're talking about the Ceph filesystem as a replacement for Lustre at that scale, the concern is less about the raw amount of data and more about the resiliency of the current code base at that size...but if you want to try it out and tell us what problems you run into we will love you forever. ;) (The scalable file system use case is what actually spawned the Ceph project, so in theory there shouldn't be any serious scaling bottlenecks. In practice it will depend on what kind of metadata throughput you need because the multi-MDS stuff is improving but still less stable.) -Greg -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] reset osd perf counters
Is there a way to 'reset' the osd perf counters? The numbers for osd 73 though osd 83 look really high compared to the rest of the numbers I see here. I was wondering if I could clear the counters out, so that I have a fresh set of data to work with. root@cephmount1:/var/log/samba# ceph osd perf osdid fs_commit_latency(ms) fs_apply_latency(ms) 0 0 45 1 0 14 2 0 47 3 0 25 4 1 44 5 12 6 12 7 0 39 8 0 32 9 0 34 10 2 186 11 0 68 12 11 13 0 34 14 01 15 2 37 16 0 23 17 0 28 18 0 26 19 0 22 20 02 21 2 24 22 0 33 23 01 24 3 98 25 2 70 26 01 27 3 99 28 02 29 2 101 30 2 72 31 2 81 32 3 112 33 3 94 34 4 152 35 0 56 36 02 37 2 58 38 01 39 03 40 02 41 02 42 11 43 02 44 1 44 45 02 46 01 47 3 85 48 01 49 2 75 50 4 398 51 3 115 52 01 53 2 47 54 6 290 55 5 153 56 7 453 57 2 66 58 11 59 5 196 60 00 61 0 93 62 09 63 01 64 01 65 04 66 01 67 0 18 68 0 16 69 0 81 70 0 70 71 00 72 01 7374 1217 74 01 7564 1238 7692 1248 77 01 78 01 79 109 1333 8068 1451 8166 1192 8295 1215 8381 1331 84 3 56 85 3 65 86 01 87 3 55 88 4 42 89 3 59 90 4 52 91 2 34 92 0 17 93 01 94 0
Re: [ceph-users] ceph on peta scale
Thanks Greg, No i am more into large scale RADOS system not filesystem . however for geographic distributed datacentres specially when network flactuate how to handle that as i read it seems CEPH need big pipe of network /Zee On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote: On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: I just finished configuring ceph up to 100 TB with openstack ... Since we are also using Lustre in our HPC machines , just wondering what is the bottle neck in ceph going on Peta Scale like Lustre . any idea ? or someone tried it If you're talking about people building a petabyte Ceph system, there are *many* who run clusters of that size. If you're talking about the Ceph filesystem as a replacement for Lustre at that scale, the concern is less about the raw amount of data and more about the resiliency of the current code base at that size...but if you want to try it out and tell us what problems you run into we will love you forever. ;) (The scalable file system use case is what actually spawned the Ceph project, so in theory there shouldn't be any serious scaling bottlenecks. In practice it will depend on what kind of metadata throughput you need because the multi-MDS stuff is improving but still less stable.) -Greg -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD Journal Best Practice
Hi everyone: I plan to use SSD Journal to improve performance. I have one 1.2T SSD disk per server. what is the best practice for SSD Journal ? There are there choice to deploy SSD Journal 1. all osd used same ssd partion ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd 2.each osd used one ssd partion ceph-deploy osd create ceph-node:sdb:/dev/ssd1 ceph-node:sdc:/dev/ssd2 3.each osd used a file for Journal, this file is on ssd disk ceph-deploy osd create ceph-node:sdb:/mnt/ssd/ssd1 ceph-node:sdc:/mnt/ssd/ssd2 Any suggest? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to get ceph-extras packages for centos7
Hi experts, Could you some guys guide me how to get ceph-extras packages for Centos7? I try to install giant in centos7 manually, however, I get the latest extras packages only for centos6.4 in repository. BTW, Is the qemu aware to the giant? Shoud I get the dedicated one to the giant? Thanks in advance Thanks -- Ray Shi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH question - failing to rebalance after failure test
Hi, [redirecting back to list] Oh, it could be that... can you include the output from 'ceph osd tree'? That's a more concise view that shows up/down, weight, and in/out. Thanks! sage root@cepharm17:~# ceph osd tree # idweight type name up/down reweight -1 0.52root default -21 0.16chassis board0 -2 0.032 host cepharm11 0 0.032 osd.0 up 1 -3 0.032 host cepharm12 1 0.032 osd.1 up 1 -4 0.032 host cepharm13 2 0.032 osd.2 up 1 -5 0.032 host cepharm14 3 0.032 osd.3 up 1 -6 0.032 host cepharm16 4 0.032 osd.4 up 1 -22 0.18chassis board1 -7 0.03host cepharm18 5 0.03osd.5 up 1 -8 0.03host cepharm19 6 0.03osd.6 up 1 -9 0.03host cepharm20 7 0.03osd.7 up 1 -10 0.03host cepharm21 8 0.03osd.8 up 1 -11 0.03host cepharm22 9 0.03osd.9 up 1 -12 0.03host cepharm23 10 0.03osd.10 up 1 -23 0.18chassis board2 -13 0.03host cepharm25 11 0.03osd.11 up 1 -14 0.03host cepharm26 12 0.03osd.12 up 1 -15 0.03host cepharm27 13 0.03osd.13 up 1 -16 0.03host cepharm28 14 0.03osd.14 up 1 -17 0.03host cepharm29 15 0.03osd.15 up 1 -18 0.03host cepharm30 16 0.03osd.16 up 1 I am working on one of these boxes: http://www.ambedded.com.tw/pt_spec.php?P_ID=20141109001 So, each chassis is one 7-node board (with a shared 1gbe switch and shared electrical supply), and I figured each board is definitely a separate failure domain. Regards, --ck ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA zone_reclaim_mode
On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.netmailto:s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? Cheers, Dan Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] reset osd perf counters
perf reset on the admin socket. I'm not sure what version it went in to; you can check the release logs if it doesn't work on whatever you have installed. :) -Greg On Mon, Jan 12, 2015 at 2:26 PM, Shain Miley smi...@npr.org wrote: Is there a way to 'reset' the osd perf counters? The numbers for osd 73 though osd 83 look really high compared to the rest of the numbers I see here. I was wondering if I could clear the counters out, so that I have a fresh set of data to work with. root@cephmount1:/var/log/samba# ceph osd perf osdid fs_commit_latency(ms) fs_apply_latency(ms) 0 0 45 1 0 14 2 0 47 3 0 25 4 1 44 5 12 6 12 7 0 39 8 0 32 9 0 34 10 2 186 11 0 68 12 11 13 0 34 14 01 15 2 37 16 0 23 17 0 28 18 0 26 19 0 22 20 02 21 2 24 22 0 33 23 01 24 3 98 25 2 70 26 01 27 3 99 28 02 29 2 101 30 2 72 31 2 81 32 3 112 33 3 94 34 4 152 35 0 56 36 02 37 2 58 38 01 39 03 40 02 41 02 42 11 43 02 44 1 44 45 02 46 01 47 3 85 48 01 49 2 75 50 4 398 51 3 115 52 01 53 2 47 54 6 290 55 5 153 56 7 453 57 2 66 58 11 59 5 196 60 00 61 0 93 62 09 63 01 64 01 65 04 66 01 67 0 18 68 0 16 69 0 81 70 0 70 71 00 72 01 7374 1217 74 01 7564 1238 7692 1248 77 01 78 01 79 109 1333 8068 1451 8166 1192 8295 1215 8381 1331 84 3 56 85 3 65 86 01 87 3 55 88
Re: [ceph-users] cephfs modification time
Zheng, this looks like a kernel client issue to me, or else something funny is going on with the cap flushing and the timestamps (note how the reading client's ctime is set to an even second, while the mtime is ~.63 seconds later and matches what the writing client sees). Any ideas? -Greg On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote: Hi Gregory, $ uname -a Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux Kernel Client, using `mount -t ceph ...` core@coreos2 /var/run/systemd/system $ modinfo ceph filename: /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net alias: fs-ceph depends:libceph intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 core@coreos2 /var/run/systemd/system $ modinfo libceph filename: /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net depends:libcrc32c intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 ceph is installed on a ubuntu containers (same kernel): $ dpkg -l |grep ceph ii ceph 0.87-1trusty amd64distributed storage and file system ii ceph-common 0.87-1trusty amd64common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 0.87-1trusty amd64common utilities to mount and interact with a ceph file system ii ceph-fuse0.87-1trusty amd64FUSE-based client for the Ceph distributed file system ii ceph-mds 0.87-1trusty amd64metadata server for the ceph distributed file system ii libcephfs1 0.87-1trusty amd64Ceph distributed file system client library ii python-ceph 0.87-1trusty amd64Python libraries for the Ceph distributed filesystem Reproducing the error: at machine 1: core@coreos1 /var/lib/deis/store/logs $ test.log core@coreos1 /var/lib/deis/store/logs $ echo 1 test.log core@coreos1 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.637234229 + Birth: - at machine 2: core@coreos2 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.0 + Birth: - Change time is not updated making some tail libs to not show new content until you force the change time be updated, like running a touch in the file. Some tools freeze and trigger other issues in the system. Tests, all in the machine #2: FAILED - https://github.com/ActiveState/tail FAILED - /usr/bin/tail of a Google docker image running debian wheezy PASSED - /usr/bin/tail of a ubuntu 14.04 docker image PASSED - /usr/bin/tail of the coreos release 494.5.0 Tests in machine #1 (same machine that is writing the file) all tests pass. On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote: What versions of all the Ceph pieces are you using? (Kernel client/ceph-fuse, MDS, etc) Can you provide more details on exactly what the program is doing on which nodes? -Greg On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote: first 3 stat commands shows blocks and size changing, but not the times after a touch it changes and tail works I saw some cephfs freezes related to it, it came back after touching the files coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 148564 Blocks: 291IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:00.100582619
[ceph-users] Ceph erasure-coded pool
All, I wish to experiment with erasure-coded pools in Ceph. I've got some questions: 1. Is FIREFLY a reasonable release to be using to try EC pools? When I look at various bits of development info, it appears that the work is complete in FIREFLY, but I thought I'd askJ 2. It looks, in FIREFLY, as if not all I/O operations can be performed on EC pools. I am trying to work with RBD clients, and I've run into some conflicting information... can RBDs run on EC pools directly, or is a caching tier required? a. Assuming a cache tier is required, where might I read information on sizing the cache tier? b. Looking through the issues, it appears there are some race conditions (e.g., #9285http://tracker.ceph.com/issues/9285) for cache tiers in FIREFLY. Should I avoid cache tiers at this level? At what level, if any, are these addressed (I don't see commits in #9285http://tracker.ceph.com/issues/9285, for example)? 3. When configuring EC pools, to specify the number of PGs, can I reasonably assume that I should use (K+M) instead of a replica count? So, for example, if I have 24 OSDs, and my EC profile has K=8 and M=4, then I should specify 200 (i.e., (24*100)/12) placement groups? 4. As I add OSDs, can I adjust the number of PGs? Thanks in advance... ___ Don Doerner Quantum Corporation -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph on peta scale
however for geographic distributed datacentres specially when network flactuate how to handle that as i read it seems CEPH need big pipe of network Ceph isn't really suited for WAN-style distribution. Some users have high-enough and consistent-enough bandwidth (with low enough latency) to do it, but otherwise you probably want to use Ceph within the data centers and layer something else on top of it. Indeed. Ceph is not aware of WAN links. So reads and writes will be done remotely even if there is a copy locally. Bandwidth might not be much of an issue but latency certainly will be. Although bandwidth during a rebalance of data might also be problematic... Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD Journal Best Practice
For the first choice: ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd i find ceph-deploy will create partition automaticaly, and each partition is 5G default. So the first choice and second choice is almost the same. Compare to filesystem, I perfer to block device to get more better performance. From: lidc...@redhat.com Date: 2015-01-12 12:35 To: ceph-us...@ceph.com Subject: SSD Journal Best Practice Hi everyone: I plan to use SSD Journal to improve performance. I have one 1.2T SSD disk per server. what is the best practice for SSD Journal ? There are there choice to deploy SSD Journal 1. all osd used same ssd partion ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd 2.each osd used one ssd partion ceph-deploy osd create ceph-node:sdb:/dev/ssd1 ceph-node:sdc:/dev/ssd2 3.each osd used a file for Journal, this file is on ssd disk ceph-deploy osd create ceph-node:sdb:/mnt/ssd/ssd1 ceph-node:sdc:/mnt/ssd/ssd2 Any suggest? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replace corrupt journal
On Sun, 11 Jan 2015, Sahlstrom, Claes wrote: Hi, I have a problem starting a couple of OSDs because of the journal being corrupt. Is there any way to replace the journal and keeping the rest of the OSD intact. It is risky at best... I would not recommend it! The safe route is to wipe the OSD and let the cluster repair. -1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read past sequence 8188178 but header indicates the journal has committed up through 8188206, journal is corrupt 0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)' thread 7fb32df86900 time 2015-01-11 16:02:54.475276 os/FileJournal.cc: 1693: FAILED assert(0) Do you mind making a note that you saw this on this ticket: http://tracker.ceph.com/issues/6003 We see it periodically in QA but have never been able to track it down. It could also be caused by a hardware issue, so any information about whether the journal device appears damanged would be helpful. Thanks! sage___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replace corrupt journal
Thanks for the reply, I have had some more time to mess around more with this now. I understand that the best thing is to allow it to rebuild the entire OSD, but I am currently only using one replica and 2/3 machines had problems I ended up in a bad situation. With OSDs down on 2 machines and one replica I think I would lose data for certain if I rebuilt them from scratch. Luckily in my case there was no new data being written to the cluster at that time, I only use it as a NAS in my home-lab. It did work out fine for me this time but I guess anyone reading this should know it is not a recommended way to do things. I got confused because I was reusing a logical volume as journal and I didn´t wipe it properly before I used --mkjournal, after wiping it properly and then using --mkjournal seems to have solved the problem for me. My only withstanding issue now is one pg that remains inconsistent even after trying to do a repair, besides that everything seems to be fine. I haven´t digged too much into that yet, with only one replica I guess it is ticky to guess which of the replicas that is the broken one. I will add a note to that ticket, it happened when the power to the server was lost while replicating and I think that is what made two journals corrupt. Cheers, Claes -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: den 12 januari 2015 15:46 To: Sahlstrom, Claes Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] Replace corrupt journal On Sun, 11 Jan 2015, Sahlstrom, Claes wrote: Hi, I have a problem starting a couple of OSDs because of the journal being corrupt. Is there any way to replace the journal and keeping the rest of the OSD intact. It is risky at best... I would not recommend it! The safe route is to wipe the OSD and let the cluster repair. -1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read past sequence 8188178 but header indicates the journal has committed up through 8188206, journal is corrupt 0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)' thread 7fb32df86900 time 2015-01-11 16:02:54.475276 os/FileJournal.cc: 1693: FAILED assert(0) Do you mind making a note that you saw this on this ticket: http://tracker.ceph.com/issues/6003 We see it periodically in QA but have never been able to track it down. It could also be caused by a hardware issue, so any information about whether the journal device appears damanged would be helpful. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Caching
I have a couple of questions about caching: I have 5 VM-Hosts serving 20 VMs. I have 1 Ceph pool where the VM-Disks of those 20 VMs reside as RBD Images. 1) Can i use multiple caching-tiers on the same data pool? I would like to use a local SSD OSD on each VM-Host that can serve as application accelerator local-cache for the VM-Disks. I can imagine data corruption if other VM-Hosts write to the same Ceph data pool but not using the same caching-tier. I imagine no data corruption if i know no other VM-Hosts will access that Ceph object (VM-Disk / RBD image). I would need to flush the cache of that VM-Host when i shutdown the VM on it, before i can start the VM on a different VM-Host. Or is Ceph perhaps smart enough that it would notify the above Caching-Tier to evict a cached object when there is a change on that object not changed by that caching-tier? 2) RBD Cache is useless for hosting Oracle databases? If Oracle is doing a O_SYNC and RBD Cache would flush on O_SYNC, then there would be nothing cached. Correct? 3) Would a caching tier be smart enough to flush dirty/modified objects on idle i/o? (when client i/o is not busy ceph will use that time to sync to backend) I know it will flush on at a certain capacity (50%) or on a certain age (600sec), but can it also flush on a certain busy/idle percentage or auto-magically/intelligently? Thanks, Samuel Terburg Panther-IT BV www.panther-it.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com