date:20150112

[ceph-users] Replace corrupt journal

2015-01-12 Thread Sahlstrom, Claes

Hi,

I have a problem starting a couple of OSDs because of the journal being 
corrupt. Is there any way to replace the journal and keeping the rest of the 
OSD intact.

-1 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to read past 
sequence 8188178 but header indicates the journal has committed up through 
8188206, journal is corrupt
 0 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: In 
function 'bool FileJournal::read_entry(ceph::bufferlist, uint64_t, bool*)' 
thread 7fb32df86900 time 2015-01-11 16:02:54.475276
os/FileJournal.cc: 1693: FAILED assert(0)

I ended up in this situation when osd.9 on host orange went down, and then I 
had a powerfailure on the host purple which made 2 of my journals corrupt.
-3  6   host purple
4   1   osd.4   up  1
5   1   osd.5   down0
7   2   osd.7   down0
6   2   osd.6   up  1
-4  6   host orange
8   1   osd.8   up  1
9   1   osd.9   down0

The filesystem was not in use by users, but it was replicating when the host 
went down and I figure that I still have the data on the OSD-disks, they are 
still mountable and the XFS-filesystem on them seems to be intact.

Thanks,
Claes
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph MeetUp Berlin

2015-01-12 Thread Robert Sander

Hi,

the next MeetUp in Berlin takes place on January 26 at 18:00 CET.

Our host is Deutsche Telekom, they will hold a short presentation about
their OpenStack / CEPH based production system.

Please RSVP at http://www.meetup.com/Ceph-Berlin/events/218939774/

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode

2015-01-12 Thread Dan van der Ster

(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway (wrongly marked me
down...). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for 30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all ceph command on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Cheers, Dan

On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote:
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with numactl --interleave=all. This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the

[ceph-users] error adding OSD to crushmap

2015-01-12 Thread Luis Periquito

Hi all,

I've been trying to add a few new OSDs, and as I manage everything with
puppet, it was manually adding via the CLI.

At one point it adds the OSD to the crush map using:

# ceph osd crush add 6 0.0 root=default

but I get
Error ENOENT: osd.6 does not exist.  create it before updating the crush map

If I read correctly this command should be the correct one to create the
OSD to the crush map...

is this a bug? I'm running the latest firefly 0.80.7.

thanks

PS: I just edited the crushmap, but it would make it a lot easier to do it
by the CLI commands...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Dan Van Der Ster

(apologies if you receive this more than once... apparently I cannot reply to a
1 year old message on the list).

Dear all,
I'd like to +10 this old proposal of Kyle's. Let me explain why...