[ceph-users] Replace corrupt journal

2015-01-12 Thread Sahlstrom, Claes
Hi,

I have a problem starting a couple of OSDs because of the journal being 
corrupt. Is there any way to replace the journal and keeping the rest of the 
OSD intact.

-1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read past 
sequence 8188178 but header indicates the journal has committed up through 
8188206, journal is corrupt
 0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In 
function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)' 
thread 7fb32df86900 time 2015-01-11 16:02:54.475276
os/FileJournal.cc: 1693: FAILED assert(0)

I ended up in this situation when osd.9 on host orange went down, and then I 
had a powerfailure on the host purple which made 2 of my journals corrupt.
-3  6   host purple
4   1   osd.4   up  1
5   1   osd.5   down0
7   2   osd.7   down0
6   2   osd.6   up  1
-4  6   host orange
8   1   osd.8   up  1
9   1   osd.9   down0

The filesystem was not in use by users, but it was replicating when the host 
went down and I figure that I still have the data on the OSD-disks, they are 
still mountable and the XFS-filesystem on them seems to be intact.

Thanks,
Claes
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MeetUp Berlin

2015-01-12 Thread Robert Sander
Hi,

the next MeetUp in Berlin takes place on January 26 at 18:00 CET.

Our host is Deutsche Telekom, they will hold a short presentation about
their OpenStack / CEPH based production system.

Please RSVP at http://www.meetup.com/Ceph-Berlin/events/218939774/

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode

2015-01-12 Thread Dan van der Ster
(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway (wrongly marked me
down...). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for 30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:


https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:


https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.


http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:


https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all ceph command on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Cheers, Dan


On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote:
 It seems that NUMA can be problematic for ceph-osd daemons in certain
 circumstances. Namely it seems that if a NUMA zone is running out of
 memory due to uneven allocation it is possible for a NUMA zone to
 enter reclaim mode when threads/processes are scheduled on a core in
 that zone and those processes are request memory allocations greater
 than the zones remaining memory. In order for the kernel to satisfy
 the memory allocation for those processes it needs to page out some of
 the contents of the contentious zone, which can have dramatic
 performance implications due to cache misses, etc. I see two ways an
 operator could alleviate these issues:

 Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
 ceph-osd daemons with numactl --interleave=all. This should probably
 be activated by a flag in /etc/default/ceph and modifying the
 ceph-osd.conf upstart script, along with adding a depend to the 

[ceph-users] error adding OSD to crushmap

2015-01-12 Thread Luis Periquito
Hi all,

I've been trying to add a few new OSDs, and as I manage everything with
puppet, it was manually adding via the CLI.

At one point it adds the OSD to the crush map using:

# ceph osd crush add 6 0.0 root=default

but I get
Error ENOENT: osd.6 does not exist.  create it before updating the crush map

If I read correctly this command should be the correct one to create the
OSD to the crush map...

is this a bug? I'm running the latest firefly 0.80.7.

thanks

PS: I just edited the crushmap, but it would make it a lot easier to do it
by the CLI commands...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Dan Van Der Ster
(apologies if you receive this more than once... apparently I cannot reply to a 
1 year old message on the list).

Dear all,
I'd like to +10 this old proposal of Kyle's. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway (wrongly marked me
down...). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for 30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:


https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:


https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.


http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:


https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all ceph command on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Cheers, Dan


On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote:
 It seems that NUMA can be problematic for ceph-osd daemons in certain
 circumstances. Namely it seems that if a NUMA zone is running out of
 memory due to uneven allocation it is possible for a NUMA zone to
 enter reclaim mode when threads/processes are scheduled on a core in
 that zone and those processes are request memory allocations greater
 than the zones remaining memory. In order for the kernel to satisfy
 the memory allocation for those processes it needs to page out some of
 the contents of the contentious zone, which can have dramatic
 performance implications due to cache misses, etc. I see two ways an
 operator could alleviate these issues:

 Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
 ceph-osd daemons with numactl --interleave=all. This should probably
 be activated by a flag in 

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Sage Weil
On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.

Sounds good to me.  Do you mind submitting a patch that prints a warning 
from either FileStore::_detect_fs()?  That will appear in the local 
ceph-osd.NNN.log.

Alternatively, we should send something to the cluster log 
(osd-clog.warning()  ...) but if we go that route we need to be 
careful that the logger it up and running first, which (I think) rules out 
FileStore::_detect_fs().  It could go in OSD itself although that seems 
less clean since the recommendation probably doesn't apply when 
using a backend that doesn't use a file system...

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs modification time

2015-01-12 Thread Gregory Farnum
What versions of all the Ceph pieces are you using? (Kernel
client/ceph-fuse, MDS, etc)

Can you provide more details on exactly what the program is doing on
which nodes?
-Greg

On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote:
 first 3 stat commands shows blocks and size changing, but not the times
 after a touch it changes and tail works

 I saw some cephfs freezes related to it, it came back after touching the files

 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 148564 Blocks: 291IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:00.100582619 +
 Modify: 2015-01-10 01:13:00.100582619 +
 Change: 2015-01-10 01:13:00.0 +
  Birth: -
 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 152633 Blocks: 299IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:00.100582619 +
 Modify: 2015-01-10 01:13:00.100582619 +
 Change: 2015-01-10 01:13:00.0 +
  Birth: -
 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 155763 Blocks: 305IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:00.100582619 +
 Modify: 2015-01-10 01:13:00.100582619 +
 Change: 2015-01-10 01:13:00.0 +
  Birth: -

 coreos2 logs # touch deis-router.log

 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 155763 Blocks: 305IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:46.961858103 +
 Modify: 2015-01-10 01:13:46.961858103 +
 Change: 2015-01-10 01:13:46.0 +
  Birth: -

 On Fri, Jan 9, 2015 at 11:11 PM, Lorieri lori...@gmail.com wrote:
 Hi,

 I have a program that tails a file and this file is create on another machine

 some tail programs does not work because the modification time is not
 updated in the remote machines

 I've find this old thread
 http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/11001

 it mentions the problem and suggest ntp sync

 I tried to re-sync ntp and restart the ceph cluster, but the issue persists

 do you know if it is possible to avoid this behavior ?

 thanks
 -lorieri
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] the performance issue for cache pool

2015-01-12 Thread lidc...@redhat.com
Hi everyone:

I used writeback mode for cache pool :
 
  ceph osd tier add sas ssd
  ceph osd tier add sas ssd
  ceph osd tier cache-mode ssd writeback
  ceph osd tier set-overlay sas ssd

and i also set dirty ratio and full ratio:

ceph osd pool set ssd cache_target_dirty_ratio .4 
ceph osd pool set ssd cache_target_full_ratio .8 
 
the capacity of ssd cache pool is 4T.

I used fio to test performance:
fio -filename=/dev/rbd0 -direct=1 -iodepth 32 -thread -rw=randwrite 
-ioengine=libaio -bs=16M -size=2000G -group_reporting -name=mytest

at the begin, the performance is very good, but after half a hour, I find when 
the hot cache pool begin flushing dirty objects,  the performance of rados is 
instability. from 87851 kB/s to 860 MB/s.

Do have any tunning parameters to get more stable performance?

Thanks.

2014-12-23 22:46:24.844730 mon.0 [INF] pgmap v24101: 6144 pgs: 6144 
active+clean; 1246 GB data, 4012 GB used, 45109 GB / 49121 GB avail; 680 MB/s 
wr, 1007 op/s
2014-12-23 22:46:27.851431 mon.0 [INF] pgmap v24102: 6144 pgs: 6144 
active+clean; 1246 GB data, 4012 GB used, 45109 GB / 49121 GB avail; 161 MB/s 
wr, 299 op/s
2014-12-23 22:46:28.883866 mon.0 [INF] pgmap v24103: 6144 pgs: 6144 
active+clean; 1247 GB data, 4015 GB used, 45106 GB / 49121 GB avail; 308 MB/s 
wr, 1065 op/s
2014-12-23 22:46:29.885914 mon.0 [INF] pgmap v24104: 6144 pgs: 6144 
active+clean; 1247 GB data, 4016 GB used, 45105 GB / 49121 GB avail; 701 MB/s 
wr, 1621 op/s
2014-12-23 22:46:32.842955 mon.0 [INF] pgmap v24105: 6144 pgs: 6144 
active+clean; 1247 GB data, 4016 GB used, 45105 GB / 49121 GB avail; 116 MB/s 
wr, 160 op/s
2014-12-23 22:46:33.863964 mon.0 [INF] pgmap v24106: 6144 pgs: 6144 
active+clean; 1248 GB data, 4021 GB used, 45100 GB / 49121 GB avail; 344 MB/s 
wr, 923 op/s
2014-12-23 22:46:34.861011 mon.0 [INF] pgmap v24107: 6144 pgs: 6144 
active+clean; 1248 GB data, 4021 GB used, 45100 GB / 49121 GB avail; 706 MB/s 
wr, 1564 op/s
2014-12-23 22:46:38.176885 mon.0 [INF] pgmap v24108: 6144 pgs: 6144 
active+clean; 1249 GB data, 4024 GB used, 45097 GB / 49121 GB avail; 222 MB/s 
wr, 938 op/s
2014-12-23 22:46:39.177233 mon.0 [INF] pgmap v24109: 6144 pgs: 6144 
active+clean; 1250 GB data, 4026 GB used, 45095 GB / 49121 GB avail; 427 MB/s 
wr, 1292 op/s
2014-12-23 22:46:42.842279 mon.0 [INF] pgmap v24110: 6144 pgs: 6144 
active+clean; 1250 GB data, 4026 GB used, 45095 GB / 49121 GB avail; 320 MB/s 
wr, 570 op/s
2014-12-23 22:46:43.872017 mon.0 [INF] pgmap v24111: 6144 pgs: 6144 
active+clean; 1251 GB data, 4030 GB used, 45090 GB / 49121 GB avail; 405 MB/s 
wr, 992 op/s
2014-12-23 22:46:44.862873 mon.0 [INF] pgmap v24112: 6144 pgs: 6144 
active+clean; 1251 GB data, 4030 GB used, 45090 GB / 49121 GB avail; 729 MB/s 
wr, 1755 op/s
2014-12-23 22:46:47.847813 mon.0 [INF] pgmap v24113: 6144 pgs: 6144 
active+clean; 1251 GB data, 4031 GB used, 45090 GB / 49121 GB avail; 2053 kB/s 
wr, 135 op/s
2014-12-23 22:46:48.857285 mon.0 [INF] pgmap v24114: 6144 pgs: 6144 
active+clean; 1252 GB data, 4033 GB used, 45087 GB / 49121 GB avail; 272 MB/s 
wr, 433 op/s
2014-12-23 22:46:49.871775 mon.0 [INF] pgmap v24115: 6144 pgs: 6144 
active+clean; 1252 GB data, 4034 GB used, 45087 GB / 49121 GB avail; 535 MB/s 
wr, 586 op/s
2014-12-23 22:46:52.842098 mon.0 [INF] pgmap v24116: 6144 pgs: 6144 
active+clean; 1252 GB data, 4033 GB used, 45088 GB / 49121 GB avail; 3074 kB/s 
wr, 113 op/s
2014-12-23 22:46:53.845398 mon.0 [INF] pgmap v24117: 6144 pgs: 6144 
active+clean; 1254 GB data, 4037 GB used, 45084 GB / 49121 GB avail; 342 MB/s 
wr, 571 op/s
2014-12-23 22:46:57.844137 mon.0 [INF] pgmap v24118: 6144 pgs: 6144 
active+clean; 1254 GB data, 4037 GB used, 45084 GB / 49121 GB avail; 302 MB/s 
wr, 577 op/s
2014-12-23 22:46:58.848028 mon.0 [INF] pgmap v24119: 6144 pgs: 6144 
active+clean; 1255 GB data, 4039 GB used, 45082 GB / 49121 GB avail; 319 MB/s 
wr, 897 op/s
2014-12-23 22:47:02.844724 mon.0 [INF] pgmap v24120: 6144 pgs: 6144 
active+clean; 1255 GB data, 4039 GB used, 45082 GB / 49121 GB avail; 327 MB/s 
wr, 856 op/s
2014-12-23 22:47:03.850795 mon.0 [INF] pgmap v24121: 6144 pgs: 6144 
active+clean; 1256 GB data, 4043 GB used, 45078 GB / 49121 GB avail; 297 MB/s 
wr, 887 op/s
2014-12-23 22:47:08.169046 mon.0 [INF] pgmap v24122: 6144 pgs: 6144 
active+clean; 1256 GB data, 4045 GB used, 45076 GB / 49121 GB avail; 318 MB/s 
wr, 830 op/s
2014-12-23 22:47:09.169302 mon.0 [INF] pgmap v24123: 6144 pgs: 6144 
active+clean; 1257 GB data, 4046 GB used, 45075 GB / 49121 GB avail; 133 MB/s 
wr, 257 op/s
2014-12-23 22:47:12.844073 mon.0 [INF] pgmap v24124: 6144 pgs: 6144 
active+clean; 1257 GB data, 4046 GB used, 45075 GB / 49121 GB avail; 65702 kB/s 
wr, 124 op/s
2014-12-23 22:47:13.845286 mon.0 [INF] pgmap v24125: 6144 pgs: 6144 
active+clean; 1257 GB data, 4047 GB used, 45074 GB / 49121 GB avail; 142 MB/s 
wr, 284 op/s
2014-12-23 22:47:14.846753 mon.0 [INF] pgmap v24126: 6144 pgs: 6144 
active+clean; 1257 GB data, 4047 GB used, 45074 GB / 49121 GB avail; 461 

[ceph-users] unsubscribe

2015-01-12 Thread Don Doerner
unsubscribe

Regards,

-don-

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with Rados gateway

2015-01-12 Thread Walter Valenti
Scenario:
Openstack Juno RDO on Centos7.
Ceph version: Giant.

On Centos7 there isn't more the old fastcgi,
but there's mod_fcgid



The apache VH is the following:
VirtualHost *:8080
ServerName rdo-ctrl01
DocumentRoot /var/www/radosgw
RewriteEngine On
RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) 
/s3gw.fcgi?page=$1params=$2%{QUERY_STRING} 
[E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
Directory /var/www/radosgw
Options +ExecCGI
AllowOverride All
SetHandler fcgid-script
Order allow,deny
Allow from all
AuthBasicAuthoritative Off
/Directory
AllowEncodedSlashes On
ErrorLog /var/log/httpd/error.log
CustomLog /var/log/httpd/access.log combined
ServerSignature Off
/VirtualHost


On /var/www/radosgw there's the cgi file s3gw.fcgi:
#!/bin/sh
exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway -d 
--debug-rgw 20 --debug-ms 1


For the configuration I've followed this documentation:
http://docs.ceph.com/docs/next/radosgw/config/

When I try to access to the object storage I've got the following errors:

1) Apache VH error:
[Wed Jan 07 13:15:22.029411 2015] [fcgid:info] [pid 2051] mod_fcgid: server 
rdo-ctrl01:/var/www/radosgw/s3gw.fcgi(28527) started
2015-01-07 13:15:22.046644 7ff16e240880  0 ceph version 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process radosgw, pid 28527
2015-01-07 13:15:22.053673 7ff16e240880  1 -- :/0 messenger.start
2015-01-07 13:15:22.054783 7ff16e240880  1 -- :/1028527 -- 
163.162.90.120:6789/0 -- auth(proto 0 40 bytes epoch 0) v1 -- ?+0 0x11d9100 con 
0x11a0870
2015-01-07 13:15:22.055339 7ff16e238700  1 -- 163.162.90.120:0/1028527 learned 
my addr 163.162.90.120:0/1028527
2015-01-07 13:15:22.056425 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 1  mon_map magic: 0 v1  200+0+0 (3839442293 
0 0) 0x7ff148000ab0 con 0x11a0870
2015-01-07 13:15:22.056547 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (3991100068 0 0) 0x7ff148000f70 con 0x11a0870
2015-01-07 13:15:22.056900 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 
0x7ff14c0012e0 con 0x11a0870
2015-01-07 13:15:22.057505 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
222+0+0 (1145796146 0 0) 0x7ff148000f70 con 0x11a0870
2015-01-07 13:15:22.057768 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- ?+0 
0x7ff14c001ca0 con 0x11a0870
2015-01-07 13:15:22.058496 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
425+0+0 (2903986998 0 0) 0x7ff148001200 con 0x11a0870
2015-01-07 13:15:22.058694 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x11d94c0 con 
0x11a0870
2015-01-07 13:15:22.058843 7ff16e240880  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x11d91d0 con 0x11a0870
2015-01-07 13:15:22.058934 7ff16e240880  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x11d9ab0 con 0x11a0870
2015-01-07 13:15:22.059214 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 5  mon_map magic: 0 v1  200+0+0 (3839442293 
0 0) 0x7ff148001130 con 0x11a0870
2015-01-07 13:15:22.059140 7ff1567fc700  2 
RGWDataChangesLog::ChangesRenewThread: start
2015-01-07 13:15:22.059737 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 
(1877860257 0 0) 0x7ff148001410 con 0x11a0870
2015-01-07 13:15:22.059869 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 7  osd_map(52..52 src has 1..52) v3  
5987+0+0 (3066791464 0 0) 0x7ff148002d50 con 0x11a0870
2015-01-07 13:15:22.060250 7ff16e240880 20 get_obj_state: rctx=0x119c2f0 
obj=.rgw.root:default.region state=0x119dba8 s-prefetch_data=0
2015-01-07 13:15:22.060302 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 8  mon_subscribe_ack(300s) v1  20+0+0 
(1877860257 0 0) 0x7ff148001130 con 0x11a0870
2015-01-07 13:15:22.060325 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 9  osd_map(52..52 src has 1..52) v3  
5987+0+0 (3066791464 0 0) 0x7ff1480046f0 con 0x11a0870
2015-01-07 13:15:22.060333 7ff16e240880 10 cache get: 
name=.rgw.root+default.region : miss
2015-01-07 13:15:22.060342 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 == 
mon.0 163.162.90.120:6789/0 10  mon_subscribe_ack(300s) v1  20+0+0 
(1877860257 0 0) 0x7ff148004bb0 con 0x11a0870
2015-01-07 13:15:22.060444 7ff16e240880  1 -- 163.162.90.120:0/1028527 -- 
163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=53}) v2 -- ?+0 
0x119eaf0 con 0x11a0870
2015-01-07 13:15:22.060805 

Re: [ceph-users] rbd directory listing performance issues

2015-01-12 Thread Shain Miley
Hi,
I am just wondering if anyone has any thoughts on the questions below...I would 
like to order some additional hardware ASAP...and the order that I place may 
change depending on the feedback that I receive.

Thanks again,

Shain

Sent from my iPhone

 On Jan 9, 2015, at 2:45 PM, Shain Miley smi...@npr.org wrote:
 
 Although it seems like having a regularly scheduled cron job to do a 
 recursive directory listing may be ok for us as a bit of a work around...I am 
 still in the processes of trying to improve performance.
 
 A few other questions have come up as a result.
 
 a)I am in the process of looking at specs for a new rbd 'headnode' that will 
 be used to mount our 100TB rbd image.  At some point in the future we may 
 look into the performance, and multi client access that cephfs could 
 offer...is there any reason that I would not be able to use this new server 
 as both an rbd client and an mds server (assuming the hardware is good 
 enough)?  I know that some cluster functions should not and cannot be mixed 
 on the same server...is this by any chance one of them?
 
 b)Currently the 100TB rbd image is acting as one large repository for our 
 archivethis will only grow over time.   I understand that ceph is pool 
 based...however I am wondering if I would somehow see any better per rbd 
 image performance...if for example...instead of having 1 x 100TB rbd 
 image...I had 4 x 25TB rbd images (since we really could split these up based 
 on our internal groups).
 
 c)Would adding a few ssd drives (in the right quantity) to each node help out 
 with reads as well as writes?
 
 d)I am a bit confused about how to enable the rbd cache option on the 
 client...is this change something that only needs to be made to the ceph.conf 
 file on the rbd kernel client server...or do the mds and osd servers need the 
 ceph.conf file modified as well and their services restarted?
 
 Other options that I might be looking into going forward are moving some of 
 this data (the data actually needed by our php apps) to rgw...although that 
 option adds some more complexity and unfamiliarity for our users.
 
 Thanks again for all the help so far.
 
 Shain
 
 On 01/07/2015 03:40 PM, Shain Miley wrote:
 Just to follow up on this thread, the main reason that the rbd directory 
 listing latency was an issue for us,  was that we were seeing a large amount 
 of IO delay in a PHP app that reads from that rbd image.
 
 It occurred to me (based on Roberts cache_dir suggestion below) that maybe 
 doing a recursive find or a recursive directory listing inside the one 
 folder in question might speed things up.
 
 After doing the recursive find...the directory listing seems much faster and 
 the responsiveness of the PHP app has increased as well.
 
 Hopefully nothing else will need to be done here, however it seems that 
 worst case...a daily or weekly cronjob that traverses the directory tree in 
 that folder might be all we need.
 
 Thanks again for all the help.
 
 Shain
 
 
 
 Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
 smi...@npr.org | 202.513.3649
 
 
 From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Shain 
 Miley [smi...@npr.org]
 Sent: Tuesday, January 06, 2015 8:16 PM
 To: Christian Balzer; ceph-us...@ceph.com
 Subject: Re: [ceph-users] rbd directory listing performance issues
 
 Christian,
 
 Each of the OSD's server nodes are running on Dell R-720xd's with 64 GB or 
 RAM.
 
 We have 107 OSD's so I have not checked all of them..however the ones I have 
 checked with xfs_db, have shown anywhere from 1% to 4% fragmentation.
 
 I'll try to upgrade the client server to 32 or 64 GB of ram at some point 
 soon...however at this point all the tuning that I have done has not yielded 
 all that much in terms of results.
 
 It maybe a simple fact that I need to look into adding some SSD's, and the 
 overall bottleneck here are the 4TB 7200 rpm disks we are using.
 
 In general, when looking at the graphs in Calamari, we see around 20ms 
 latency (await) for our OSD's however there are lots of times where we see 
 (via the graphs) spikes of 250ms to 400ms as well.
 
 Thanks again,
 
 Shain
 
 
 Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
 smi...@npr.org | 202.513.3649
 
 
 From: Christian Balzer [ch...@gol.com]
 Sent: Tuesday, January 06, 2015 7:34 PM
 To: ceph-us...@ceph.com
 Cc: Shain Miley
 Subject: Re: [ceph-users] rbd directory listing performance issues
 
 Hello,
 
 On Tue, 6 Jan 2015 15:29:50 + Shain Miley wrote:
 
 Hello,
 
 We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of
 107 x 4TB drives formatted with xfs. The cluster is running ceph version
 0.80.7:
 I assume journals on the same HDD then.
 
 How much memory per node?
 
 [snip]
 A while back I created an 80 TB rbd image to be used as an archive
 repository for some of our audio and video 

Re: [ceph-users] ceph on peta scale

2015-01-12 Thread Gregory Farnum
On Mon, Jan 12, 2015 at 3:55 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote:
 Thanks Greg, No i am more into large scale RADOS system not filesystem .

 however for geographic distributed datacentres specially when network
 flactuate how to handle that as i read it seems CEPH need big pipe of
 network

Ceph isn't really suited for WAN-style distribution. Some users have
high-enough and consistent-enough bandwidth (with low enough latency)
to do it, but otherwise you probably want to use Ceph within the data
centers and layer something else on top of it.
-Greg


 /Zee

 On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se
 wrote:
  I just finished configuring ceph up to 100 TB with openstack ... Since
  we
  are also using Lustre in our HPC machines , just wondering what is the
  bottle neck in ceph going on Peta Scale like Lustre .
 
  any idea ? or someone tried it

 If you're talking about people building a petabyte Ceph system, there
 are *many* who run clusters of that size. If you're talking about the
 Ceph filesystem as a replacement for Lustre at that scale, the concern
 is less about the raw amount of data and more about the resiliency of
 the current code base at that size...but if you want to try it out and
 tell us what problems you run into we will love you forever. ;)
 (The scalable file system use case is what actually spawned the Ceph
 project, so in theory there shouldn't be any serious scaling
 bottlenecks. In practice it will depend on what kind of metadata
 throughput you need because the multi-MDS stuff is improving but still
 less stable.)
 -Greg




 --

 Regards

 Zeeshan Ali Shah
 System Administrator - PDC HPC
 PhD researcher (IT security)
 Kungliga Tekniska Hogskolan
 +46 8 790 9115
 http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reset osd perf counters

2015-01-12 Thread Shain Miley

Is there a way to 'reset' the osd perf counters?

The numbers for osd 73 though osd 83 look really high compared to the 
rest of the numbers I see here.


I was wondering if I could clear the counters out, so that I have a 
fresh set of data to work with.



root@cephmount1:/var/log/samba# ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms)
0 0   45
1 0   14
2 0   47
3 0   25
4 1   44
5 12
6 12
7 0   39
8 0   32
9 0   34
   10 2  186
   11 0   68
   12 11
   13 0   34
   14 01
   15 2   37
   16 0   23
   17 0   28
   18 0   26
   19 0   22
   20 02
   21 2   24
   22 0   33
   23 01
   24 3   98
   25 2   70
   26 01
   27 3   99
   28 02
   29 2  101
   30 2   72
   31 2   81
   32 3  112
   33 3   94
   34 4  152
   35 0   56
   36 02
   37 2   58
   38 01
   39 03
   40 02
   41 02
   42 11
   43 02
   44 1   44
   45 02
   46 01
   47 3   85
   48 01
   49 2   75
   50 4  398
   51 3  115
   52 01
   53 2   47
   54 6  290
   55 5  153
   56 7  453
   57 2   66
   58 11
   59 5  196
   60 00
   61 0   93
   62 09
   63 01
   64 01
   65 04
   66 01
   67 0   18
   68 0   16
   69 0   81
   70 0   70
   71 00
   72 01
   7374 1217
   74 01
   7564 1238
   7692 1248
   77 01
   78 01
   79   109 1333
   8068 1451
   8166 1192
   8295 1215
   8381 1331
   84 3   56
   85 3   65
   86 01
   87 3   55
   88 4   42
   89 3   59
   90 4   52
   91 2   34
   92 0   17
   93 01
   94 0   

Re: [ceph-users] ceph on peta scale

2015-01-12 Thread Zeeshan Ali Shah
Thanks Greg, No i am more into large scale RADOS system not filesystem .

however for geographic distributed datacentres specially when network
flactuate how to handle that as i read it seems CEPH need big pipe of
network

/Zee

On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se
 wrote:
  I just finished configuring ceph up to 100 TB with openstack ... Since we
  are also using Lustre in our HPC machines , just wondering what is the
  bottle neck in ceph going on Peta Scale like Lustre .
 
  any idea ? or someone tried it

 If you're talking about people building a petabyte Ceph system, there
 are *many* who run clusters of that size. If you're talking about the
 Ceph filesystem as a replacement for Lustre at that scale, the concern
 is less about the raw amount of data and more about the resiliency of
 the current code base at that size...but if you want to try it out and
 tell us what problems you run into we will love you forever. ;)
 (The scalable file system use case is what actually spawned the Ceph
 project, so in theory there shouldn't be any serious scaling
 bottlenecks. In practice it will depend on what kind of metadata
 throughput you need because the multi-MDS stuff is improving but still
 less stable.)
 -Greg




-- 

Regards

Zeeshan Ali Shah
System Administrator - PDC HPC
PhD researcher (IT security)
Kungliga Tekniska Hogskolan
+46 8 790 9115
http://www.pdc.kth.se/members/zashah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Journal Best Practice

2015-01-12 Thread lidc...@redhat.com
Hi everyone:
  I plan to use SSD Journal to improve performance.
  I have one 1.2T SSD disk per server.

  what is the best practice for SSD Journal ?
  There are there choice to deploy SSD Journal
  1. all osd used same ssd partion
  ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd
  2.each osd used one ssd partion
  ceph-deploy osd create ceph-node:sdb:/dev/ssd1 ceph-node:sdc:/dev/ssd2
  3.each osd used a file for Journal, this file is on ssd disk
  ceph-deploy osd create ceph-node:sdb:/mnt/ssd/ssd1 
ceph-node:sdc:/mnt/ssd/ssd2

  Any suggest?
   Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to get ceph-extras packages for centos7

2015-01-12 Thread lei shi
Hi experts,
Could you some guys guide me how to get ceph-extras packages for Centos7?
I try to install giant in centos7 manually, however, I get the latest
extras packages only for centos6.4 in repository.
BTW, Is the qemu aware to the giant? Shoud I get the dedicated one to the
giant?
Thanks in advance
Thanks

--
Ray Shi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH question - failing to rebalance after failure test

2015-01-12 Thread Christopher Kunz
Hi,

[redirecting back to list]
 Oh, it could be that... can you include the output from 'ceph osd tree'?  
 That's a more concise view that shows up/down, weight, and in/out.
 
 Thanks!
 sage
 

root@cepharm17:~# ceph osd tree
# idweight  type name   up/down reweight
-1  0.52root default
-21 0.16chassis board0
-2  0.032   host cepharm11
0   0.032   osd.0   up  1   
-3  0.032   host cepharm12
1   0.032   osd.1   up  1   
-4  0.032   host cepharm13
2   0.032   osd.2   up  1   
-5  0.032   host cepharm14
3   0.032   osd.3   up  1   
-6  0.032   host cepharm16
4   0.032   osd.4   up  1   
-22 0.18chassis board1
-7  0.03host cepharm18
5   0.03osd.5   up  1   
-8  0.03host cepharm19
6   0.03osd.6   up  1   
-9  0.03host cepharm20
7   0.03osd.7   up  1   
-10 0.03host cepharm21
8   0.03osd.8   up  1   
-11 0.03host cepharm22
9   0.03osd.9   up  1   
-12 0.03host cepharm23
10  0.03osd.10  up  1   
-23 0.18chassis board2
-13 0.03host cepharm25
11  0.03osd.11  up  1   
-14 0.03host cepharm26
12  0.03osd.12  up  1   
-15 0.03host cepharm27
13  0.03osd.13  up  1   
-16 0.03host cepharm28
14  0.03osd.14  up  1   
-17 0.03host cepharm29
15  0.03osd.15  up  1   
-18 0.03host cepharm30
16  0.03osd.16  up  1   

I am working on one of these boxes:
http://www.ambedded.com.tw/pt_spec.php?P_ID=20141109001
So, each chassis is one 7-node board (with a shared 1gbe switch and
shared electrical supply), and I figured each board is definitely a
separate failure domain.

Regards,

--ck
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Dan Van Der Ster

On 12 Jan 2015, at 17:08, Sage Weil 
s...@newdream.netmailto:s...@newdream.net wrote:

On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Sounds good to me.  Do you mind submitting a patch that prints a warning
from either FileStore::_detect_fs()?  That will appear in the local
ceph-osd.NNN.log.

Alternatively, we should send something to the cluster log
(osd-clog.warning()  ...) but if we go that route we need to be
careful that the logger it up and running first, which (I think) rules out
FileStore::_detect_fs().  It could go in OSD itself although that seems
less clean since the recommendation probably doesn't apply when
using a backend that doesn't use a file system…

Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB 
already solved the heuristic:

https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp

It’s licensed as AGPLv3 -- do you already know if we can borrow such code into 
Ceph?

Cheers, Dan


Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reset osd perf counters

2015-01-12 Thread Gregory Farnum
perf reset on the admin socket. I'm not sure what version it went in
to; you can check the release logs if it doesn't work on whatever you
have installed. :)
-Greg


On Mon, Jan 12, 2015 at 2:26 PM, Shain Miley smi...@npr.org wrote:
 Is there a way to 'reset' the osd perf counters?

 The numbers for osd 73 though osd 83 look really high compared to the rest
 of the numbers I see here.

 I was wondering if I could clear the counters out, so that I have a fresh
 set of data to work with.


 root@cephmount1:/var/log/samba# ceph osd perf
 osdid fs_commit_latency(ms) fs_apply_latency(ms)
 0 0   45
 1 0   14
 2 0   47
 3 0   25
 4 1   44
 5 12
 6 12
 7 0   39
 8 0   32
 9 0   34
10 2  186
11 0   68
12 11
13 0   34
14 01
15 2   37
16 0   23
17 0   28
18 0   26
19 0   22
20 02
21 2   24
22 0   33
23 01
24 3   98
25 2   70
26 01
27 3   99
28 02
29 2  101
30 2   72
31 2   81
32 3  112
33 3   94
34 4  152
35 0   56
36 02
37 2   58
38 01
39 03
40 02
41 02
42 11
43 02
44 1   44
45 02
46 01
47 3   85
48 01
49 2   75
50 4  398
51 3  115
52 01
53 2   47
54 6  290
55 5  153
56 7  453
57 2   66
58 11
59 5  196
60 00
61 0   93
62 09
63 01
64 01
65 04
66 01
67 0   18
68 0   16
69 0   81
70 0   70
71 00
72 01
7374 1217
74 01
7564 1238
7692 1248
77 01
78 01
79   109 1333
8068 1451
8166 1192
8295 1215
8381 1331
84 3   56
85 3   65
86 01
87 3   55
88   

Re: [ceph-users] cephfs modification time

2015-01-12 Thread Gregory Farnum
Zheng, this looks like a kernel client issue to me, or else something
funny is going on with the cap flushing and the timestamps (note how
the reading client's ctime is set to an even second, while the mtime
is ~.63 seconds later and matches what the writing client sees). Any
ideas?
-Greg

On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote:
 Hi Gregory,


 $ uname -a
 Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64
 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux


 Kernel Client, using  `mount -t ceph ...`


 core@coreos2 /var/run/systemd/system $ modinfo ceph
 filename:   /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 alias:  fs-ceph
 depends:libceph
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256

 core@coreos2 /var/run/systemd/system $ modinfo libceph
 filename:   /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko
 license:GPL
 description:Ceph filesystem for Linux
 author: Patience Warnick patie...@newdream.net
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 depends:libcrc32c
 intree: Y
 vermagic:   3.17.7+ SMP mod_unload
 signer: Magrathea: Glacier signing key
 sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2
 sig_hashalgo:   sha256



 ceph is installed on a ubuntu containers (same kernel):

 $ dpkg -l |grep ceph

 ii  ceph 0.87-1trusty
 amd64distributed storage and file system
 ii  ceph-common  0.87-1trusty
 amd64common utilities to mount and interact with a ceph
 storage cluster
 ii  ceph-fs-common   0.87-1trusty
 amd64common utilities to mount and interact with a ceph file
 system
 ii  ceph-fuse0.87-1trusty
 amd64FUSE-based client for the Ceph distributed file system
 ii  ceph-mds 0.87-1trusty
 amd64metadata server for the ceph distributed file system
 ii  libcephfs1   0.87-1trusty
 amd64Ceph distributed file system client library
 ii  python-ceph  0.87-1trusty
 amd64Python libraries for the Ceph distributed filesystem



 Reproducing the error:

 at machine 1:
 core@coreos1 /var/lib/deis/store/logs $  test.log
 core@coreos1 /var/lib/deis/store/logs $ echo 1  test.log
 core@coreos1 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.637234229 +
  Birth: -

 at machine 2:
 core@coreos2 /var/lib/deis/store/logs $ stat test.log
   File: 'test.log'
   Size: 2 Blocks: 1  IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511629882  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (  500/core)   Gid: (  500/core)
 Access: 2015-01-12 20:05:03.0 +
 Modify: 2015-01-12 20:06:09.637234229 +
 Change: 2015-01-12 20:06:09.0 +
  Birth: -


 Change time is not updated making some tail libs to not show new
 content until you force the change time be updated, like running a
 touch in the file.
 Some tools freeze and trigger other issues in the system.


 Tests, all in the machine #2:

 FAILED - https://github.com/ActiveState/tail
 FAILED - /usr/bin/tail of a Google docker image running debian wheezy
 PASSED - /usr/bin/tail of a ubuntu 14.04 docker image
 PASSED - /usr/bin/tail of the coreos release 494.5.0


 Tests in machine #1 (same machine that is writing the file) all tests pass.



 On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote:
 What versions of all the Ceph pieces are you using? (Kernel
 client/ceph-fuse, MDS, etc)

 Can you provide more details on exactly what the program is doing on
 which nodes?
 -Greg

 On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote:
 first 3 stat commands shows blocks and size changing, but not the times
 after a touch it changes and tail works

 I saw some cephfs freezes related to it, it came back after touching the 
 files

 coreos2 logs # stat deis-router.log
   File: 'deis-router.log'
   Size: 148564 Blocks: 291IO Block: 4194304 regular file
 Device: 0h/0d Inode: 1099511628780  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2015-01-10 01:13:00.100582619 

[ceph-users] Ceph erasure-coded pool

2015-01-12 Thread Don Doerner
All,
I wish to experiment with erasure-coded pools in Ceph.  I've got some questions:

1.  Is FIREFLY a reasonable release to be using to try EC pools?  When I 
look at various bits of development info, it appears that the work is complete 
in FIREFLY, but I thought I'd askJ

2.  It looks, in FIREFLY, as if not all I/O operations can be performed on 
EC pools.  I am trying to work with RBD clients, and I've run into some 
conflicting information... can RBDs run on EC pools directly, or is a caching 
tier required?

a.  Assuming a cache tier is required, where might I read information on 
sizing the cache tier?

b.  Looking through the issues, it appears there are some race conditions 
(e.g., #9285http://tracker.ceph.com/issues/9285) for cache tiers in FIREFLY.  
Should I avoid cache tiers at this level?  At what level, if any, are these 
addressed (I don't see commits in #9285http://tracker.ceph.com/issues/9285, 
for example)?

3.  When configuring EC pools, to specify the number of PGs, can I 
reasonably assume that I should use (K+M) instead of a replica count?  So, for 
example, if I have 24 OSDs, and my EC profile has K=8 and M=4, then I should 
specify 200 (i.e., (24*100)/12) placement groups?

4.  As I add OSDs, can I adjust the number of PGs?
Thanks in advance...
___

Don Doerner
Quantum Corporation

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on peta scale

2015-01-12 Thread Robert van Leeuwen
 however for geographic distributed datacentres specially when network
 flactuate how to handle that as i read it seems CEPH need big pipe of
 network

Ceph isn't really suited for WAN-style distribution. Some users have
high-enough and consistent-enough bandwidth (with low enough latency)
to do it, but otherwise you probably want to use Ceph within the data
centers and layer something else on top of it.

Indeed.
Ceph is not aware of WAN links.
So reads and writes will be done remotely even if there is a copy locally.
Bandwidth might not be much of an issue but latency certainly will be.
Although bandwidth during a rebalance of data might also be problematic...

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal Best Practice

2015-01-12 Thread lidc...@redhat.com
For the first choice:
 ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd
i find ceph-deploy will create partition automaticaly, and each partition is 5G 
default.
So the first choice and second choice is almost the same.
Compare to filesystem, I perfer to block device to get more better performance. 


 
From: lidc...@redhat.com
Date: 2015-01-12 12:35
To: ceph-us...@ceph.com
Subject: SSD Journal Best Practice
Hi everyone:
  I plan to use SSD Journal to improve performance.
  I have one 1.2T SSD disk per server.

  what is the best practice for SSD Journal ?
  There are there choice to deploy SSD Journal
  1. all osd used same ssd partion
  ceph-deploy osd create ceph-node:sdb:/dev/ssd ceph-node:sdc:/dev/ssd
  2.each osd used one ssd partion
  ceph-deploy osd create ceph-node:sdb:/dev/ssd1 ceph-node:sdc:/dev/ssd2
  3.each osd used a file for Journal, this file is on ssd disk
  ceph-deploy osd create ceph-node:sdb:/mnt/ssd/ssd1 
ceph-node:sdc:/mnt/ssd/ssd2

  Any suggest?
   Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replace corrupt journal

2015-01-12 Thread Sage Weil
On Sun, 11 Jan 2015, Sahlstrom, Claes wrote:
 
 Hi,
 
  
 
 I have a problem starting a couple of OSDs because of the journal being
 corrupt. Is there any way to replace the journal and keeping the rest of the
 OSD intact.

It is risky at best... I would not recommend it!  The safe route is to 
wipe the OSD and let the cluster repair.

     -1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read
 past sequence 8188178 but header indicates the journal has committed up
 through 8188206, journal is corrupt
 
  0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In
 function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)'
 thread 7fb32df86900 time 2015-01-11 16:02:54.475276
 
 os/FileJournal.cc: 1693: FAILED assert(0)

Do you mind making a note that you saw this on this ticket:

http://tracker.ceph.com/issues/6003

We see it periodically in QA but have never been able to track it down.  
It could also be caused by a hardware issue, so any information about 
whether the journal device appears damanged would be helpful.

Thanks!
sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replace corrupt journal

2015-01-12 Thread Sahlstrom, Claes
Thanks for the reply, I have had some more time to mess around more with this 
now.

I understand that the best thing is to allow it to rebuild the entire OSD, but 
I am currently only using one replica and 2/3 machines had problems I ended up 
in a bad situation. With OSDs down on 2 machines and one replica I think I 
would lose data for certain if I rebuilt them from scratch. Luckily in my case 
there was no new data being written to the cluster at that time, I only use it 
as a NAS in my home-lab.

It did work out fine for me this time but I guess anyone reading this should 
know it is not a recommended way to do things. I got confused because I was 
reusing a logical volume as journal and I didn´t wipe it properly before I used 
--mkjournal, after wiping it properly and then using --mkjournal seems to 
have solved the problem for me.

My only withstanding issue now is one pg that remains inconsistent even after 
trying to do a repair, besides that everything seems to be fine. I haven´t 
digged too much into that yet, with only one replica I guess it is ticky to 
guess which of the replicas that is the broken one.

I will add a note to that ticket, it happened when the power to the server was 
lost while replicating and I think that is what made two journals corrupt.
 
Cheers,
Claes



-Original Message-
From: Sage Weil [mailto:s...@newdream.net] 
Sent: den 12 januari 2015 15:46
To: Sahlstrom, Claes
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Replace corrupt journal

On Sun, 11 Jan 2015, Sahlstrom, Claes wrote:
 
 Hi,
 
  
 
 I have a problem starting a couple of OSDs because of the journal 
 being corrupt. Is there any way to replace the journal and keeping the 
 rest of the OSD intact.

It is risky at best... I would not recommend it!  The safe route is to wipe the 
OSD and let the cluster repair.

     -1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to 
 read past sequence 8188178 but header indicates the journal has 
 committed up through 8188206, journal is corrupt
 
  0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: 
 In function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, 
 bool*)'
 thread 7fb32df86900 time 2015-01-11 16:02:54.475276
 
 os/FileJournal.cc: 1693: FAILED assert(0)

Do you mind making a note that you saw this on this ticket:

http://tracker.ceph.com/issues/6003

We see it periodically in QA but have never been able to track it down.  
It could also be caused by a hardware issue, so any information about whether 
the journal device appears damanged would be helpful.

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Caching

2015-01-12 Thread Samuel Terburg - Panther-IT BV

I have a couple of questions about caching:

I have 5 VM-Hosts serving 20 VMs.
I have 1 Ceph pool where the VM-Disks of those 20 VMs reside as RBD Images.

1) Can i use multiple caching-tiers on the same data pool?
   I would like to use a local SSD OSD on each VM-Host that can serve 
as application accelerator local-cache for the VM-Disks.
   I can imagine data corruption if other VM-Hosts write to the same 
Ceph data pool but not using the same caching-tier.
   I imagine no data corruption if i know no other VM-Hosts will access 
that Ceph object (VM-Disk / RBD image).
   I would need to flush the cache of that VM-Host when i shutdown the 
VM on it, before i can start the VM on a different VM-Host.
   Or is Ceph perhaps smart enough that it would notify the above 
Caching-Tier to evict a cached object when there is a change on that 
object not changed by that caching-tier?


2) RBD Cache is useless for hosting Oracle databases?
   If Oracle is doing a O_SYNC and RBD Cache would flush on O_SYNC, 
then there would be nothing cached. Correct?


3) Would a caching tier be smart enough to flush dirty/modified objects 
on idle i/o?

   (when client i/o is not busy ceph will use that time to sync to backend)
   I know it will flush on at a certain capacity (50%) or on a certain 
age (600sec), but can it also flush on a certain busy/idle percentage or 
auto-magically/intelligently?



Thanks,


Samuel Terburg
Panther-IT BV
www.panther-it.nl



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com