Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-09 Thread Danny Al-Gaaf
Am 10.01.2013 05:32, schrieb Gary Lowell:
> I have this patch, and the ones from Friday in the wip-rpm-update branch.  
> Everything looks good except that we have the following new warning from 
> configure:
> 
> ….
> checking for kaffe... no
> checking for java... java
> checking for uudecode... no
> WARNING: configure: I have to compile Test.class from scratch
> checking for gcj... no
> checking for guavac... no
> checking for jikes… no
> ….
> 
> This may have to do with something in our build environment.

I assume you have no uudecode installed. It should be part of sharutils
(http://www.gnu.org/software/sharutils/)

Regards,

Danny
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recoverying from 95% full osd

2013-01-09 Thread Roman Hlynovskiy
Hello again!

I left the system in working state overnight and got it in a wierd
state this morning:

chef@ceph-node02:/var/log/ceph$ ceph -s
   health HEALTH_OK
   monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 254, quorum 0,1,2 a,b,c
   osdmap e348: 3 osds: 3 up, 3 in
pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB
used, 429 GB / 755 GB avail
   mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby

so, it looks ok from the first point of view,  however I am not able
to mount ceph from any of nodes:
be01:~# mount /var/www/jroger.org/data
mount: 192.168.7.11:/: can't read superblock

on the nodes, which had ceph mounted yesterday I am able to look
through the filesystem, but any kind of data read causes client to
hang.

I made a trace on the active mds with debug ms/mds = 20
(http://wh.of.kz/ceph_logs.tar.gz)
Could you please help to identify what's going on.

2013/1/9 Roman Hlynovskiy :
>>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>>
>> I believe this is it. 384 PGs, but three pools of which only one (or maybe a 
>> second one, sort of) is in use. Automatically setting the right PG counts is 
>> coming some day, but until then being able to set up pools of the right size 
>> is a big gotcha. :(
>> Depending on how mutable the data is, recreate with larger PG counts on the 
>> pools in use. Otherwise we can do something more detailed.
>> -Greg
>
> hm... what would be recommended PG size per pool ?
>
> chef@cephgw:~$ ceph osd lspools
> 0 data,1 metadata,2 rbd,
> chef@cephgw:~$ ceph osd pool get data pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get metadata pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get rbd pg_num
> PG_NUM: 128
>
> according to the 
> http://ceph.com/docs/master/rados/operations/placement-groups/
>
> (OSDs * 100)
> Total PGs = 
>Replicas
>
> I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
>
> will it make much difference to set 150 instead of 128 or I should
> base on different values?
>
> btw, just one more off-topic question:
>
> chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
> dumped allin format plain
> version113906
> last_osdmap_epoch  323
> last_pg_scan   1
> full_ratio 0.95
> nearfull_ratio 0.85
> pg_statobjectsmipdegr   unfbytes
>   log   disklog   state state_stamp  v  reported  up
> acting  last_scrub  scrub_stamp  last_deep_scrub  deep_scrub_stamp
> pool   0  74748  0  0  0
>   286157692336  17668034  17668034
> pool   1  6180  0  0
>   131846062 6414518   6414518
> pool   2  0  0  0  0
>   0 0 0
> sum75366  0  0  0
> 286289538398  24082552  24082552
> osdstatkbused kbavailkb hb in
>   hbout
> 0  157999220  106227596  264226816  [1,2]  []
> 1  185604948  78621868   264226816  [0,2]  []
> 2  219475396  44751420   264226816  [0,1]  []
> sum563079564  229600884  792680448
>
> pool 0 (data) is used for data storage
> pool 1 (metadata) is used for metadata storage
>
> what is pool 2 (rbd) for? looks like it's absolutely empty.
>
>
>>
>>>
>>> You might also adjust the crush tunables, see
>>>
>>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>>
>>> sage
>>>
>
> Thanks for the link, Sage I set tunable values according to the doc.
> Btw, online document is missing magical param for crushmap which
> allows those scary_tunables )
>
>
>
> --
> ...WBR, Roman Hlynovskiy



-- 
...WBR, Roman Hlynovskiy
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-09 Thread Gary Lowell
I have this patch, and the ones from Friday in the wip-rpm-update branch.  
Everything looks good except that we have the following new warning from 
configure:

….
checking for kaffe... no
checking for java... java
checking for uudecode... no
WARNING: configure: I have to compile Test.class from scratch
checking for gcj... no
checking for guavac... no
checking for jikes… no
….

This may have to do with something in our build environment.

Cheers,
Gary

On Jan 9, 2013, at 1:54 PM, Noah Watkins wrote:

> I haven't tested this yet, but I like it. I think several of these
> macros can be used to simplify a bit more of the Java config bit. I
> also just saw the ax_jni_include_dir macro in the autoconf archive and
> it looks like that can help clean-up too.
> 
> On Wed, Jan 9, 2013 at 1:35 PM, Danny Al-Gaaf  wrote:
>> The attached patch depends on the set of 6 patches I send some days ago.
>> See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793
>> 
>> Danny Al-Gaaf (1):
>>  configure.ac: check for org.junit.rules.ExternalResource
>> 
>> autogen.sh|   2 +-
>> configure.ac  |  29 ++---
>> m4/ac_check_class.m4  | 108 
>> ++
>> m4/ac_check_classpath.m4  |  24 +++
>> m4/ac_check_rqrd_class.m4 |  26 +++
>> m4/ac_java_options.m4 |  33 ++
>> m4/ac_prog_jar.m4 |  39 +
>> m4/ac_prog_java.m4|  83 +++
>> m4/ac_prog_java_works.m4  |  98 +
>> m4/ac_prog_javac.m4   |  45 +++
>> m4/ac_prog_javac_works.m4 |  36 
>> m4/ac_prog_javah.m4   |  28 
>> m4/ac_try_compile_java.m4 |  40 +
>> m4/ac_try_run_javac.m4|  41 ++
>> 14 files changed, 615 insertions(+), 17 deletions(-)
>> create mode 100644 m4/ac_check_class.m4
>> create mode 100644 m4/ac_check_classpath.m4
>> create mode 100644 m4/ac_check_rqrd_class.m4
>> create mode 100644 m4/ac_java_options.m4
>> create mode 100644 m4/ac_prog_jar.m4
>> create mode 100644 m4/ac_prog_java.m4
>> create mode 100644 m4/ac_prog_java_works.m4
>> create mode 100644 m4/ac_prog_javac.m4
>> create mode 100644 m4/ac_prog_javac_works.m4
>> create mode 100644 m4/ac_prog_javah.m4
>> create mode 100644 m4/ac_try_compile_java.m4
>> create mode 100644 m4/ac_try_run_javac.m4
>> 
>> --
>> 1.8.1
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash, ceph version 0.56.1

2013-01-09 Thread Ian Pye
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil  wrote:
> On Wed, 9 Jan 2013, Ian Pye wrote:
>> Hi,
>>
>> Every time I try an bring up an OSD, it crashes and I get the
>> following: "error (121) Remote I/O error not handled on operation 20"
>
> This error code (EREMOTEIO) is not used by Ceph.  What fs are you using?
> Which kernel version?  Anything else unusual happen with your hardware
> recently that might have wreaked havoc on your underlying fs?

3.7.1 kernel with XFS. Its a demo-box from a vendor, so should be brand new.

I'm going to say its a disk error, given the following:

mkfs.xfs: read failed: Input/output error

Interestingly, running an osd and btrfs worked fine on the same disk.

Thanks for the help,

Ian

>
> sage
>
>
>
>> The cluster is new and only has a little bit of data on it. Any ideas
>> what is going on? Does Remote I/O mean a network error? Full log
>> below:
>>
>>-9> 2013-01-10 00:00:20.182237 7f2ddde8f910  0
>> filestore(/mnt/dist_j/ceph)  error (121) Remote I/O error not handled
>> on operation 20 (12.0.0, or op 0, counting from 0)
>> -8> 2013-01-10 00:00:20.182275 7f2ddde8f910  0
>> filestore(/mnt/dist_j/ceph) unexpected error code
>> -7> 2013-01-10 00:00:20.182285 7f2ddde8f910  0
>> filestore(/mnt/dist_j/ceph)  transaction dump:
>> { "ops": [
>> { "op_num": 0,
>>   "op_name": "mkcoll",
>>   "collection": "0.2c0_head"},
>> { "op_num": 1,
>>   "op_name": "collection_setattr",
>>   "collection": "0.2c0_head",
>>   "name": "info",
>>   "length": 5},
>> { "op_num": 2,
>>   "op_name": "truncate",
>>   "collection": "meta",
>>   "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
>>   "offset": 0},
>> { "op_num": 3,
>>   "op_name": "write",
>>   "collection": "meta",
>>   "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
>>   "length": 531,
>>   "offset": 0,
>>   "bufferlist length": 531},
>> { "op_num": 4,
>>   "op_name": "remove",
>>   "collection": "meta",
>>   "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"},
>> { "op_num": 5,
>>   "op_name": "write",
>>   "collection": "meta",
>>   "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1",
>>   "length": 0,
>>   "offset": 0,
>>   "bufferlist length": 0},
>> { "op_num": 6,
>>   "op_name": "collection_setattr",
>>   "collection": "0.2c0_head",
>>   "name": "ondisklog",
>>   "length": 34},
>> { "op_num": 7,
>>   "op_name": "nop"}]}
>> -6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient:
>> _send_mon_message to mon.a at 108.162.209.120:6789/0
>> -5> 2013-01-10 00:00:20.183108 7f2dd5e7f910  1 --
>> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
>> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
>> v22) v1 -- ?+0 0x5b15600 con 0x34629a0
>> -4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient:
>> _send_mon_message to mon.a at 108.162.209.120:6789/0
>> -3> 2013-01-10 00:00:20.183797 7f2dd6680910  1 --
>> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
>> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
>> v22) v1 -- ?+0 0x5f75600 con 0x34629a0
>> -2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient:
>> _send_mon_message to mon.a at 108.162.209.120:6789/0
>> -1> 2013-01-10 00:00:20.184338 7f2dd5e7f910  1 --
>> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
>> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
>> v22) v1 -- ?+0 0x5b15400 con 0x34629a0
>>  0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In
>> function 'unsigned int
>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)'
>> thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422
>> os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error")
>>
>>  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>>  1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
>> long, int)+0x90a) [0x73e14a]
>>  2: (FileStore::do_transactions(std::list> std::allocator >&, unsigned long)+0x4c)
>> [0x7455dc]
>>  3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b]
>>  4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb]
>>  5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0]
>>  6: /lib/libpthread.so.0 [0x7f2de6d087aa]
>>  7: (clone()+0x6d) [0x7f2de518159d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> --- logging levels ---
>>0/ 5 none
>>0/ 1 lockdep
>>0/ 1 context
>>1/ 1 crush
>>1/ 5 mds
>>

Re: ceph caps (Ganesha + Ceph pnfs)

2013-01-09 Thread Sage Weil
On Tue, 8 Jan 2013, Matt W. Benjamin wrote:
> Hi Sage,
> 
> - "Sage Weil"  wrote:
> > Your prevoius question made it sound like the DS was interacting with
> > 
> > libcephfs and dealing with (some) MDS capabilities.  Is that right?
> > 
> > I wonder if a much simpler approach would be to make a different fh 
> > format or type, and just cram the inode and ceph object/block number 
> > in there.  Then the DS can just go direct to rados and avoid 
> > interacting with the fs at all.  There are some additional semantics 
> > surrounding the truncate metadata, but if we're lucky that can fit 
> > inside the fh, and the DS servers could really just act like object 
> > targets--no libcephfs or MDS interaction at all.
> 
> The current architecture gets the inode and block information to the DS 
> reliably already without change to the Ceph fh--decoding steering 
> information happens at the MDS, rather than the DS.  It is important to 
> us to ensure that the total steering information be "finite and 
> manageable," though, since we need it to travel with the pNFS layout to 
> the NFS client.

As a practical matter, that means your DS is actually doing an 
open/lookupo on the fh?  My general concern is that that'll kill 
performance...

> It is definitely the goal for the DS to go direct to rados.  I think the 
> outstanding issue may be limited to getting the MDS view of metadata 
> up-to-date after an extending or truncating i/o completes (at least in 
> the immediate term).

...but now I see the issue with committing the layout on the DS vs the 
MDS.

> You may well be thinking, "sheesh, the client is doing out-of-band i/o, 
> why doesn't it send the LAYOUTCOMMIT operation to the MDS to update the 
> metadata."  The unsatisfactory answer is that currently (due to our use 
> of the "files" layout type) clients can insist that the DS do the 
> commit.  The Linux kernel client does so for writes below a size 
> threshold.
> 
> For the longer term, an option is shaping up that would allow us to use 
> the objects layout (RFC 5664), which always commits layouts. 

Meaning, the client always commit the layout via the MDS after writing 
data to the objects?

> This 
> discussion seems to be adding to the argument in support of switching, 
> frankly.  My intuition is that it's preferable to let the DS jump layers 
> to commit, though, even if we want to elide such commits in future (not 
> just for expediency, but because the flexibility to do it seems like a 
> win for the Ceph architecture).

Maybe.. but if the DS's don't have open sessions with the MDS, they'd have 
to open them.  Even if they did, they'd need to get caps on the inode 
before the could flush new size/mtime metadata.  Unless we add a new 
operation to behave similar to how we normally do with cap flushes: if 
make the size at least X and mtime at least Y.

For small files, that seems like a win.  For large files, you don't want 
to send a request like that to the MDS for every object/block if you can 
do it onces from the pnfs client -> mds.

Am I understanding correctly that doing a single commit from the client 
(with the final file size) is what the object layout allows?

sage

> 
> > 
> > Either way, to your first (original question), yes, we should expose a 
> > way via libcephfs to take a reference on the capability that isn't 
> > released until the layout is committed.  That should be pretty 
> > straightforward to do, I think.
> 
> Excellent.
> 
> > 
> > Hopefully my understanding is getting closer!
> > 
> > :) sage
> > 
> 
> Indeed, thanks
> 
> -- 
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://linuxbox.com
> 
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash, ceph version 0.56.1

2013-01-09 Thread Sage Weil
On Wed, 9 Jan 2013, Ian Pye wrote:
> Hi,
> 
> Every time I try an bring up an OSD, it crashes and I get the
> following: "error (121) Remote I/O error not handled on operation 20"

This error code (EREMOTEIO) is not used by Ceph.  What fs are you using?  
Which kernel version?  Anything else unusual happen with your hardware 
recently that might have wreaked havoc on your underlying fs?

sage



> The cluster is new and only has a little bit of data on it. Any ideas
> what is going on? Does Remote I/O mean a network error? Full log
> below:
> 
>-9> 2013-01-10 00:00:20.182237 7f2ddde8f910  0
> filestore(/mnt/dist_j/ceph)  error (121) Remote I/O error not handled
> on operation 20 (12.0.0, or op 0, counting from 0)
> -8> 2013-01-10 00:00:20.182275 7f2ddde8f910  0
> filestore(/mnt/dist_j/ceph) unexpected error code
> -7> 2013-01-10 00:00:20.182285 7f2ddde8f910  0
> filestore(/mnt/dist_j/ceph)  transaction dump:
> { "ops": [
> { "op_num": 0,
>   "op_name": "mkcoll",
>   "collection": "0.2c0_head"},
> { "op_num": 1,
>   "op_name": "collection_setattr",
>   "collection": "0.2c0_head",
>   "name": "info",
>   "length": 5},
> { "op_num": 2,
>   "op_name": "truncate",
>   "collection": "meta",
>   "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
>   "offset": 0},
> { "op_num": 3,
>   "op_name": "write",
>   "collection": "meta",
>   "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
>   "length": 531,
>   "offset": 0,
>   "bufferlist length": 531},
> { "op_num": 4,
>   "op_name": "remove",
>   "collection": "meta",
>   "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"},
> { "op_num": 5,
>   "op_name": "write",
>   "collection": "meta",
>   "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1",
>   "length": 0,
>   "offset": 0,
>   "bufferlist length": 0},
> { "op_num": 6,
>   "op_name": "collection_setattr",
>   "collection": "0.2c0_head",
>   "name": "ondisklog",
>   "length": 34},
> { "op_num": 7,
>   "op_name": "nop"}]}
> -6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient:
> _send_mon_message to mon.a at 108.162.209.120:6789/0
> -5> 2013-01-10 00:00:20.183108 7f2dd5e7f910  1 --
> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
> v22) v1 -- ?+0 0x5b15600 con 0x34629a0
> -4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient:
> _send_mon_message to mon.a at 108.162.209.120:6789/0
> -3> 2013-01-10 00:00:20.183797 7f2dd6680910  1 --
> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
> v22) v1 -- ?+0 0x5f75600 con 0x34629a0
> -2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient:
> _send_mon_message to mon.a at 108.162.209.120:6789/0
> -1> 2013-01-10 00:00:20.184338 7f2dd5e7f910  1 --
> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
> v22) v1 -- ?+0 0x5b15400 con 0x34629a0
>  0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In
> function 'unsigned int
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)'
> thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422
> os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error")
> 
>  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>  1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
> long, int)+0x90a) [0x73e14a]
>  2: (FileStore::do_transactions(std::list std::allocator >&, unsigned long)+0x4c)
> [0x7455dc]
>  3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b]
>  4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb]
>  5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0]
>  6: /lib/libpthread.so.0 [0x7f2de6d087aa]
>  7: (clone()+0x6d) [0x7f2de518159d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>0/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>0/ 5 paxos

OSD crash, ceph version 0.56.1

2013-01-09 Thread Ian Pye
Hi,

Every time I try an bring up an OSD, it crashes and I get the
following: "error (121) Remote I/O error not handled on operation 20"
The cluster is new and only has a little bit of data on it. Any ideas
what is going on? Does Remote I/O mean a network error? Full log
below:

   -9> 2013-01-10 00:00:20.182237 7f2ddde8f910  0
filestore(/mnt/dist_j/ceph)  error (121) Remote I/O error not handled
on operation 20 (12.0.0, or op 0, counting from 0)
-8> 2013-01-10 00:00:20.182275 7f2ddde8f910  0
filestore(/mnt/dist_j/ceph) unexpected error code
-7> 2013-01-10 00:00:20.182285 7f2ddde8f910  0
filestore(/mnt/dist_j/ceph)  transaction dump:
{ "ops": [
{ "op_num": 0,
  "op_name": "mkcoll",
  "collection": "0.2c0_head"},
{ "op_num": 1,
  "op_name": "collection_setattr",
  "collection": "0.2c0_head",
  "name": "info",
  "length": 5},
{ "op_num": 2,
  "op_name": "truncate",
  "collection": "meta",
  "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
  "offset": 0},
{ "op_num": 3,
  "op_name": "write",
  "collection": "meta",
  "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1",
  "length": 531,
  "offset": 0,
  "bufferlist length": 531},
{ "op_num": 4,
  "op_name": "remove",
  "collection": "meta",
  "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"},
{ "op_num": 5,
  "op_name": "write",
  "collection": "meta",
  "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1",
  "length": 0,
  "offset": 0,
  "bufferlist length": 0},
{ "op_num": 6,
  "op_name": "collection_setattr",
  "collection": "0.2c0_head",
  "name": "ondisklog",
  "length": 34},
{ "op_num": 7,
  "op_name": "nop"}]}
-6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient:
_send_mon_message to mon.a at 108.162.209.120:6789/0
-5> 2013-01-10 00:00:20.183108 7f2dd5e7f910  1 --
108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
{0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
v22) v1 -- ?+0 0x5b15600 con 0x34629a0
-4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient:
_send_mon_message to mon.a at 108.162.209.120:6789/0
-3> 2013-01-10 00:00:20.183797 7f2dd6680910  1 --
108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
{0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
v22) v1 -- ?+0 0x5f75600 con 0x34629a0
-2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient:
_send_mon_message to mon.a at 108.162.209.120:6789/0
-1> 2013-01-10 00:00:20.184338 7f2dd5e7f910  1 --
108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22
{0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
v22) v1 -- ?+0 0x5b15400 con 0x34629a0
 0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)'
thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422
os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error")

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
long, int)+0x90a) [0x73e14a]
 2: (FileStore::do_transactions(std::list >&, unsigned long)+0x4c)
[0x7455dc]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb]
 5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0]
 6: /lib/libpthread.so.0 [0x7f2de6d087aa]
 7: (clone()+0x6d) [0x7f2de518159d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent10
  max_new 1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
2013-01-10 00:00:20.227763 7f2ddde8f910 -1 *** Caught signal (Aborted) **
 in thread 7f2ddde8f910

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 

Re: OSD memory leaks?

2013-01-09 Thread Dave Spano
Thank you. I appreciate it! 

Dave Spano 
Optogenics 
Systems Administrator 



- Original Message - 

From: "Sébastien Han"  
To: "Dave Spano"  
Cc: "ceph-devel" , "Samuel Just" 
 
Sent: Wednesday, January 9, 2013 5:12:12 PM 
Subject: Re: OSD memory leaks? 

Dave, I share you my little script for now if you want it: 

#!/bin/bash 

for i in $(ps aux | grep [c]eph-osd | awk '{print $4}') 
do 
MEM_INTEGER=$(echo $i | cut -d '.' -f1) 
OSD=$(ps aux | grep [c]eph-osd | grep "$i " | awk '{print $13}') 
if [[ $MEM_INTEGER -ge 25 ]];then 
service ceph restart osd.$OSD >> /dev/null 
if [ $? -eq 0 ]; then 
logger -t ceph-memory-usage "The OSD number $OSD has been restarted 
since it was using $i % of the memory" 
else 
logger -t ceph-memory-usage "ERROR while 
restarting the OSD daemon" 
fi 
else 
logger -t ceph-memory-usage "The OSD number $OSD is 
only using $i % of the memory, doing nothing" 
fi 
logger -t ceph-memory-usage "Waiting 60 seconds before testing the next OSD..." 
sleep 60 
done 

logger -t ceph-memory-usage "Ceph state after memory check operation 
is: $(ceph health)" 

Crons run with 10 min interval everyday for each storage node ;-). 

Waiting for some Inktank guys now :-). 
-- 
Regards, 
Sébastien Han. 


On Wed, Jan 9, 2013 at 10:42 PM, Dave Spano  wrote: 
> That's very good to know. I'll be restarting ceph-osd right now! Thanks for 
> the heads up! 
> 
> Dave Spano 
> Optogenics 
> Systems Administrator 
> 
> 
> 
> - Original Message - 
> 
> From: "Sébastien Han"  
> To: "Dave Spano"  
> Cc: "ceph-devel" , "Samuel Just" 
>  
> Sent: Wednesday, January 9, 2013 11:35:13 AM 
> Subject: Re: OSD memory leaks? 
> 
> If you wait too long, the system will trigger OOM killer :D, I already 
> experienced that unfortunately... 
> 
> Sam? 
> 
> On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano  wrote: 
>> OOM killer 
> 
> 
> 
> -- 
> Regards, 
> Sébastien Han.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Sébastien Han
Dave, I share you my little script for now if you want it:

#!/bin/bash

for i in $(ps aux | grep [c]eph-osd | awk '{print $4}')
do
MEM_INTEGER=$(echo $i | cut -d '.' -f1)
OSD=$(ps aux | grep [c]eph-osd | grep "$i " | awk '{print $13}')
if [[ $MEM_INTEGER -ge 25 ]];then
service ceph restart osd.$OSD >> /dev/null
if [ $? -eq 0 ]; then
logger -t ceph-memory-usage "The OSD number $OSD has been restarted
since it was using $i % of the memory"
else
logger -t ceph-memory-usage "ERROR while
restarting the OSD daemon"
fi
else
logger -t ceph-memory-usage "The OSD number $OSD is
only using $i % of the memory, doing nothing"
fi
logger -t ceph-memory-usage "Waiting 60 seconds before testing the next OSD..."
sleep 60
done

logger -t ceph-memory-usage "Ceph state after memory check operation
is: $(ceph health)"

Crons run with 10 min interval everyday for each storage node ;-).

Waiting for some Inktank guys now :-).
--
Regards,
Sébastien Han.


On Wed, Jan 9, 2013 at 10:42 PM, Dave Spano  wrote:
> That's very good to know. I'll be restarting ceph-osd right now! Thanks for 
> the heads up!
>
> Dave Spano
> Optogenics
> Systems Administrator
>
>
>
> - Original Message -
>
> From: "Sébastien Han" 
> To: "Dave Spano" 
> Cc: "ceph-devel" , "Samuel Just" 
> 
> Sent: Wednesday, January 9, 2013 11:35:13 AM
> Subject: Re: OSD memory leaks?
>
> If you wait too long, the system will trigger OOM killer :D, I already
> experienced that unfortunately...
>
> Sam?
>
> On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano  wrote:
>> OOM killer
>
>
>
> --
> Regards,
> Sébastien Han.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-09 Thread Noah Watkins
I haven't tested this yet, but I like it. I think several of these
macros can be used to simplify a bit more of the Java config bit. I
also just saw the ax_jni_include_dir macro in the autoconf archive and
it looks like that can help clean-up too.

On Wed, Jan 9, 2013 at 1:35 PM, Danny Al-Gaaf  wrote:
> The attached patch depends on the set of 6 patches I send some days ago.
> See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793
>
> Danny Al-Gaaf (1):
>   configure.ac: check for org.junit.rules.ExternalResource
>
>  autogen.sh|   2 +-
>  configure.ac  |  29 ++---
>  m4/ac_check_class.m4  | 108 
> ++
>  m4/ac_check_classpath.m4  |  24 +++
>  m4/ac_check_rqrd_class.m4 |  26 +++
>  m4/ac_java_options.m4 |  33 ++
>  m4/ac_prog_jar.m4 |  39 +
>  m4/ac_prog_java.m4|  83 +++
>  m4/ac_prog_java_works.m4  |  98 +
>  m4/ac_prog_javac.m4   |  45 +++
>  m4/ac_prog_javac_works.m4 |  36 
>  m4/ac_prog_javah.m4   |  28 
>  m4/ac_try_compile_java.m4 |  40 +
>  m4/ac_try_run_javac.m4|  41 ++
>  14 files changed, 615 insertions(+), 17 deletions(-)
>  create mode 100644 m4/ac_check_class.m4
>  create mode 100644 m4/ac_check_classpath.m4
>  create mode 100644 m4/ac_check_rqrd_class.m4
>  create mode 100644 m4/ac_java_options.m4
>  create mode 100644 m4/ac_prog_jar.m4
>  create mode 100644 m4/ac_prog_java.m4
>  create mode 100644 m4/ac_prog_java_works.m4
>  create mode 100644 m4/ac_prog_javac.m4
>  create mode 100644 m4/ac_prog_javac_works.m4
>  create mode 100644 m4/ac_prog_javah.m4
>  create mode 100644 m4/ac_try_compile_java.m4
>  create mode 100644 m4/ac_try_run_javac.m4
>
> --
> 1.8.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Dave Spano
That's very good to know. I'll be restarting ceph-osd right now! Thanks for the 
heads up! 

Dave Spano 
Optogenics 
Systems Administrator 



- Original Message - 

From: "Sébastien Han"  
To: "Dave Spano"  
Cc: "ceph-devel" , "Samuel Just" 
 
Sent: Wednesday, January 9, 2013 11:35:13 AM 
Subject: Re: OSD memory leaks? 

If you wait too long, the system will trigger OOM killer :D, I already 
experienced that unfortunately... 

Sam? 

On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano  wrote: 
> OOM killer 



-- 
Regards, 
Sébastien Han.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-09 Thread Danny Al-Gaaf
Check for org.junit.rules.ExternalResource if build with
--enable-cephfs-java and --with-debug. Checking for junit4
isn't enough since junit4 has this class not before 4.7.

Added some m4 files to get some JAVA related macros. Changed
autogen.sh to work with local m4 files/macros.

Signed-off-by: Danny Al-Gaaf 
---
 autogen.sh|   2 +-
 configure.ac  |  29 ++---
 m4/ac_check_class.m4  | 108 ++
 m4/ac_check_classpath.m4  |  24 +++
 m4/ac_check_rqrd_class.m4 |  26 +++
 m4/ac_java_options.m4 |  33 ++
 m4/ac_prog_jar.m4 |  39 +
 m4/ac_prog_java.m4|  83 +++
 m4/ac_prog_java_works.m4  |  98 +
 m4/ac_prog_javac.m4   |  45 +++
 m4/ac_prog_javac_works.m4 |  36 
 m4/ac_prog_javah.m4   |  28 
 m4/ac_try_compile_java.m4 |  40 +
 m4/ac_try_run_javac.m4|  41 ++
 14 files changed, 615 insertions(+), 17 deletions(-)
 create mode 100644 m4/ac_check_class.m4
 create mode 100644 m4/ac_check_classpath.m4
 create mode 100644 m4/ac_check_rqrd_class.m4
 create mode 100644 m4/ac_java_options.m4
 create mode 100644 m4/ac_prog_jar.m4
 create mode 100644 m4/ac_prog_java.m4
 create mode 100644 m4/ac_prog_java_works.m4
 create mode 100644 m4/ac_prog_javac.m4
 create mode 100644 m4/ac_prog_javac_works.m4
 create mode 100644 m4/ac_prog_javah.m4
 create mode 100644 m4/ac_try_compile_java.m4
 create mode 100644 m4/ac_try_run_javac.m4

diff --git a/autogen.sh b/autogen.sh
index 08e435b..9d6a77b 100755
--- a/autogen.sh
+++ b/autogen.sh
@@ -12,7 +12,7 @@ check_for_pkg_config() {
 }
 
 rm -f config.cache
-aclocal #-I m4
+aclocal -I m4 --install
 check_for_pkg_config
 libtoolize --force --copy
 autoconf
diff --git a/configure.ac b/configure.ac
index 832054b..32814b8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -271,9 +271,6 @@ AM_CONDITIONAL(ENABLE_CEPHFS_JAVA, test 
"x$enable_cephfs_java" = "xyes")
 AC_ARG_WITH(jdk-dir,
 AC_HELP_STRING([--with-jdk-dir(=DIR)], [Path to JDK directory]))
 
-AC_DEFUN([JAVA_DNE],
-   AC_MSG_ERROR([Cannot find $1 '$2'. Try setting --with-jdk-dir]))
-
 AS_IF([test "x$enable_cephfs_java" = "xyes"], [
 
# setup bin/include dirs from --with-jdk-dir (search for jni.h, javac)
@@ -314,20 +311,20 @@ AS_IF([test "x$enable_cephfs_java" = "xyes"], [
   AC_MSG_NOTICE([Cannot find junit4.jar (apt-get install junit4)])
   [have_junit4=0]])])
 
-   # Check for Java programs: javac, javah, jar
-PATH_save=$PATH
-   PATH="$PATH:$EXTRA_JDK_BIN_DIR"
-   AC_PATH_PROG(JAVAC, javac)
-AC_PATH_PROG(JAVAH, javah)
-AC_PATH_PROG(JAR, jar)
-PATH=$PATH_save
+  AC_CHECK_CLASSPATH
+  AC_PROG_JAVAC
+  AC_PROG_JAVAH
+  AC_PROG_JAR
 
-# Ensure we have them...
-AS_IF([test -z "$JAVAC"], JAVA_DNE(program, javac))
-AS_IF([test -z "$JAVAH"], JAVA_DNE(program, javah))
-AS_IF([test -z "$JAR"], JAVA_DNE(program, jar))
+  CLASSPATH=$CLASSPATH:$EXTRA_CLASSPATH_JAR
+  export CLASSPATH
+  AC_MSG_NOTICE([classpath - $CLASSPATH])
+  AS_IF([test "$have_junit4" = "1"], [
+   AC_CHECK_CLASS([org.junit.rules.ExternalResource], [], [
+   AC_MSG_NOTICE(Could not find org.junit.rules.ExternalResource)
+   have_junit4=0])])
 
-# Check for jni.h
+# Check for jni.h
CPPFLAGS_save=$CPPFLAGS
 
AS_IF([test -n "$EXTRA_JDK_INC_DIR"],
@@ -336,7 +333,7 @@ AS_IF([test "x$enable_cephfs_java" = "xyes"], [
 [JDK_CPPFLAGS="$JDK_CPPFLAGS 
-I$EXTRA_JDK_INC_DIR/linux"])
   CPPFLAGS="$CPPFLAGS $JDK_CPPFLAGS"])
 
-   AC_CHECK_HEADER([jni.h], [], JAVA_DNE(header, jni.h))
+   AC_CHECK_HEADER([jni.h], [], AC_MSG_ERROR([Cannot find header 'jni.h'. 
Try setting --with-jdk-dir]))
 
CPPFLAGS=$CPPFLAGS_save
 
diff --git a/m4/ac_check_class.m4 b/m4/ac_check_class.m4
new file mode 100644
index 000..17932c5
--- /dev/null
+++ b/m4/ac_check_class.m4
@@ -0,0 +1,108 @@
+dnl @synopsis AC_CHECK_CLASS
+dnl
+dnl AC_CHECK_CLASS tests the existence of a given Java class, either in
+dnl a jar or in a '.class' file.
+dnl
+dnl *Warning*: its success or failure can depend on a proper setting of
+dnl the CLASSPATH env. variable.
+dnl
+dnl Note: This is part of the set of autoconf M4 macros for Java
+dnl programs. It is VERY IMPORTANT that you download the whole set,
+dnl some macros depend on other. Unfortunately, the autoconf archive
+dnl does not support the concept of set of macros, so I had to break it
+dnl for submission. The general documentation, as well as the sample
+dnl configure.in, is included in the AC_PROG_JAVA macro.
+dnl
+dnl @category Java
+dnl @author Stephane Bortzmeyer 
+dnl @version 2000-07-19
+dnl @license GPLWithACException
+
+AC_DEFUN([AC_CHECK_CLASS],[
+AC_REQUIRE([AC_PR

[PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-09 Thread Danny Al-Gaaf
The attached patch depends on the set of 6 patches I send some days ago.
See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793

Danny Al-Gaaf (1):
  configure.ac: check for org.junit.rules.ExternalResource

 autogen.sh|   2 +-
 configure.ac  |  29 ++---
 m4/ac_check_class.m4  | 108 ++
 m4/ac_check_classpath.m4  |  24 +++
 m4/ac_check_rqrd_class.m4 |  26 +++
 m4/ac_java_options.m4 |  33 ++
 m4/ac_prog_jar.m4 |  39 +
 m4/ac_prog_java.m4|  83 +++
 m4/ac_prog_java_works.m4  |  98 +
 m4/ac_prog_javac.m4   |  45 +++
 m4/ac_prog_javac_works.m4 |  36 
 m4/ac_prog_javah.m4   |  28 
 m4/ac_try_compile_java.m4 |  40 +
 m4/ac_try_run_javac.m4|  41 ++
 14 files changed, 615 insertions(+), 17 deletions(-)
 create mode 100644 m4/ac_check_class.m4
 create mode 100644 m4/ac_check_classpath.m4
 create mode 100644 m4/ac_check_rqrd_class.m4
 create mode 100644 m4/ac_java_options.m4
 create mode 100644 m4/ac_prog_jar.m4
 create mode 100644 m4/ac_prog_java.m4
 create mode 100644 m4/ac_prog_java_works.m4
 create mode 100644 m4/ac_prog_javac.m4
 create mode 100644 m4/ac_prog_javac_works.m4
 create mode 100644 m4/ac_prog_javah.m4
 create mode 100644 m4/ac_try_compile_java.m4
 create mode 100644 m4/ac_try_run_javac.m4

-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: geo replication

2013-01-09 Thread Mark Kampe

Right now, your only option is synchronous replication, which
happens at the speed of the slowest OSD ... so unless your
WAN links are fast and fat, it comes at non-negligible
performance penalty.

We will soon be sending out a proposal for an asynchronous
replication mechanism with eventual consistency for the
RADOS Gateway ... but that is a somewhat simpler problem
(immutable objects, good change lists, and a WAN friendly
protocol).

Asynchronous RADOS replication is definitely on our list,
but more complex and farther out.

On 01/09/2013 01:19 PM, Gandalf Corvotempesta wrote:

probably this was already asked before but i'm unable to find any answer.
Is possible to replicate a cluster geografically?

GlusterFS does this with rsync (i think called automatically on every
file write), does cheph do something similiar?

I don't think that using multiple geographically distributed OSD with
10-15ms of latency will be good

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] osd/ReplicatedPG.cc: fix errors in _scrub()

2013-01-09 Thread Danny Al-Gaaf
Fix build error introduced with 5b12b514b047a8a46cc5549bd94b398289b9b5f6:

osd/ReplicatedPG.cc: In member function 'virtual void 
ReplicatedPG::_scrub(ScrubMap&)':
osd/ReplicatedPG.cc:7116:4: error: 'errors' was not declared in this scope

Increment scrubber.errors instead of errors.

Signed-off-by: Danny Al-Gaaf 
---
 src/osd/ReplicatedPG.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index e8a68fe..1645041 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -7113,7 +7113,7 @@ void ReplicatedPG::_scrub(ScrubMap& scrubmap)
   if (head == hobject_t()) {
osd->clog.error() << mode << " " << info.pgid << " " << soid
  << " found clone without head";
-   ++errors;
+   ++scrubber.errors;
continue;
   }
 
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows port

2013-01-09 Thread Matt W. Benjamin
Hi,

Along the same lines, (p)NFS access from Windows clients should already be 
possible, for some definition of possible.  We'll make it actually possible 
over the next few months.  

Matt

- "Sage Weil"  wrote:

> On Wed, 9 Jan 2013, Florian Haas wrote:
> > On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey 
> wrote:
> > > Hi,
> > >
> > > I am also curious if a Windows port, specifically the client-side,
> is
> > > on the roadmap.
> > 
> > This is somewhat OT from the original post, but if all you're
> > interested is using RBD block storage from Windows, you can already
> do
> > that by going through an iSCSI or FC head node. Proof-of-concept
> > configuration outlined here:
> > 
> >
> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
> > 
> > Not sure if this helps, but just thought I'd mention it.
> 
> There is also a patch for Samba that glues libcephfs into Samba's VFS
> 
> layer.  This will let you reexport CephFS via CIFS.  These patches are
> 
> currently living at
> 
>   https://github.com/ceph/samba/commits/ceph-v3-6-test
> 
> If anybody is interested in playing with these, have at it!  Inktank 
> doesn't have resources to focus on it right now.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Sébastien Han
Hi,

Thanks for the input.

I also have tons of "socket closed", I recall that this message is
harmless. Anyway Cephx is disable on my platform from the beginning...
Anyone to approve or disapprove my "scrub theory"?
--
Regards,
Sébastien Han.


On Wed, Jan 9, 2013 at 7:09 PM, Sylvain Munaut
 wrote:
> Just fyi, I also have growing memory on OSD, and I have the same logs:
>
> "libceph: osd4 172.20.11.32:6801 socket closed" in the RBD clients
>
>
> I traced that problem and correlated it to some cephx issue in the OSD
> some time ago in this thread
>
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg10634.html
>
> but the thread kind of died without a solution ...
>
> Cheers,
>
>Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Sylvain Munaut
Just fyi, I also have growing memory on OSD, and I have the same logs:

"libceph: osd4 172.20.11.32:6801 socket closed" in the RBD clients


I traced that problem and correlated it to some cephx issue in the OSD
some time ago in this thread

http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg10634.html

but the thread kind of died without a solution ...

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows port

2013-01-09 Thread Sage Weil
On Wed, 9 Jan 2013, Florian Haas wrote:
> On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey  wrote:
> > Hi,
> >
> > I am also curious if a Windows port, specifically the client-side, is
> > on the roadmap.
> 
> This is somewhat OT from the original post, but if all you're
> interested is using RBD block storage from Windows, you can already do
> that by going through an iSCSI or FC head node. Proof-of-concept
> configuration outlined here:
> 
> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
> 
> Not sure if this helps, but just thought I'd mention it.

There is also a patch for Samba that glues libcephfs into Samba's VFS 
layer.  This will let you reexport CephFS via CIFS.  These patches are 
currently living at

https://github.com/ceph/samba/commits/ceph-v3-6-test

If anybody is interested in playing with these, have at it!  Inktank 
doesn't have resources to focus on it right now.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Sage Weil
On Wed, 9 Jan 2013, Dennis Jacobfeuerborn wrote:
> On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote:
> > Hi all,
> > 
> > in expectation of better performance, we are just switching from CEPH 
> > version 0.48 to 0.56.1
> > for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.
> > 
> > We are now wondering whether there are currently any development activities 
> > concerning further significant performance enhancements, 
> > or whether further significant performance enhancements are already planned 
> > for the near future.
> > 
> > I would now be loath to start benchmarking with 0.56.1 and then, a month or 
> > so later, detect that there have been significant performance enhancements 
> > in CEPH in the meantime.
> 
> There shouldn't be any major changes since v0.56.x is a stable release and
> as such should only receive bug-/securityfixes and non-risky improvements.
> Any changes that would result in a significant change in performance would
> probably be too disruptive for a stable release series.

That is generally true.  One exception is that there may be some simple 
changes that can decrease the impact of data migration on performance.  
There are some changes we made for a customer that seem to make a big 
difference and will be making it into the main tree (and hopefully 
bobtail, and possibly even argonaut) shortly.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Sébastien Han
If you wait too long, the system will trigger OOM killer :D, I already
experienced that unfortunately...

Sam?

On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano  wrote:
> OOM killer



--
Regards,
Sébastien Han.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Dave Spano
Yes, I'm using argonaut. 

I've got 38 heap files from yesterday. Currently, the OSD in question is using 
91.2% of memory according to top, and staying there. I initially thought it 
would go until the OOM killer started killing processes, but I don't see 
anything funny in the system logs that indicate that. 

On the other hand, the ceph-osd process on osd.1 is using far less memory. 

osd.0
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
 9151 root  20   0 20.4g  14g 2548 S1 91.2 517:58.71 ceph-osd 

osd.1

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
10785 root  20   0  673m 310m 5164 S3  1.9 107:04.39 ceph-osd  

Here's what tcmalloc says when I run ceph osd tell 0 heap stats:
2013-01-09 11:09:36.778675 7f62aae23700  0 log [INF] : osd.0tcmalloc heap 
stats:
2013-01-09 11:09:36.779113 7f62aae23700  0 log [INF] : MALLOC:  210884768 ( 
 201.1 MB) Bytes in use by application
2013-01-09 11:09:36.779348 7f62aae23700  0 log [INF] : MALLOC: + 89026560 ( 
  84.9 MB) Bytes in page heap freelist
2013-01-09 11:09:36.779928 7f62aae23700  0 log [INF] : MALLOC: +  7926512 ( 
   7.6 MB) Bytes in central cache freelist
2013-01-09 11:09:36.779951 7f62aae23700  0 log [INF] : MALLOC: +   144896 ( 
   0.1 MB) Bytes in transfer cache freelist
2013-01-09 11:09:36.779972 7f62aae23700  0 log [INF] : MALLOC: + 11046512 ( 
  10.5 MB) Bytes in thread cache freelists
2013-01-09 11:09:36.780013 7f62aae23700  0 log [INF] : MALLOC: +  5177344 ( 
   4.9 MB) Bytes in malloc metadata
2013-01-09 11:09:36.780030 7f62aae23700  0 log [INF] : MALLOC:   
2013-01-09 11:09:36.780056 7f62aae23700  0 log [INF] : MALLOC: =324206592 ( 
 309.2 MB) Actual memory used (physical + swap)
2013-01-09 11:09:36.780081 7f62aae23700  0 log [INF] : MALLOC: +126177280 ( 
 120.3 MB) Bytes released to OS (aka unmapped)
2013-01-09 11:09:36.780112 7f62aae23700  0 log [INF] : MALLOC:   
2013-01-09 11:09:36.780127 7f62aae23700  0 log [INF] : MALLOC: =450383872 ( 
 429.5 MB) Virtual address space used
2013-01-09 11:09:36.780152 7f62aae23700  0 log [INF] : MALLOC:
2013-01-09 11:09:36.780168 7f62aae23700  0 log [INF] : MALLOC:  37492   
   Spans in use
2013-01-09 11:09:36.780330 7f62aae23700  0 log [INF] : MALLOC: 51   
   Thread heaps in use
2013-01-09 11:09:36.780359 7f62aae23700  0 log [INF] : MALLOC:   4096   
   Tcmalloc page size
2013-01-09 11:09:36.780384 7f62aae23700  0 log [INF] : 



Dave Spano 
Optogenics 
Systems Administrator 



- Original Message - 

From: "Sébastien Han"  
To: "Samuel Just"  
Cc: "Dave Spano" , "ceph-devel" 
 
Sent: Wednesday, January 9, 2013 10:20:43 AM 
Subject: Re: OSD memory leaks? 

I guess he runs Argonaut as well. 

More suggestions about this problem? 

Thanks! 

-- 
Regards, 
Sébastien Han. 


On Mon, Jan 7, 2013 at 8:09 PM, Samuel Just  wrote: 
> 
> Awesome! What version are you running (ceph-osd -v, include the hash)? 
> -Sam 
> 
> On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano  wrote: 
> > This failed the first time I sent it, so I'm resending in plain text. 
> > 
> > Dave Spano 
> > Optogenics 
> > Systems Administrator 
> > 
> > 
> > 
> > - Original Message - 
> > 
> > From: "Dave Spano"  
> > To: "Sébastien Han"  
> > Cc: "ceph-devel" , "Samuel Just" 
> >  
> > Sent: Monday, January 7, 2013 12:40:06 PM 
> > Subject: Re: OSD memory leaks? 
> > 
> > 
> > Sam, 
> > 
> > Attached are some heaps that I collected today. 001 and 003 are just after 
> > I started the profiler; 011 is the most recent. If you need more, or 
> > anything different let me know. Already the OSD in question is at 38% 
> > memory usage. As mentioned by Sèbastien, restarting ceph-osd keeps things 
> > going. 
> > 
> > Not sure if this is helpful information, but out of the two OSDs that I 
> > have running, the first one (osd.0) is the one that develops this problem 
> > the quickest. osd.1 does have the same issue, it just takes much longer. Do 
> > the monitors hit the first osd in the list first, when there's activity? 
> > 
> > 
> > Dave Spano 
> > Optogenics 
> > Systems Administrator 
> > 
> > 
> > - Original Message - 
> > 
> > From: "Sébastien Han"  
> > To: "Samuel Just"  
> > Cc: "ceph-devel"  
> > Sent: Friday, January 4, 2013 10:20:58 AM 
> > Subject: Re: OSD memory leaks? 
> > 
> > Hi Sam, 
> > 
> > Thanks for your answer and sorry the late reply. 
> > 
> > Unfortunately I can't get something out from the profiler, actually I 
> > do but I guess it doesn't show what is supposed to show... I will keep

Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2013-01-09 Thread Noah Watkins
Hi Jutta,

On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
 wrote:
>
> the current content of the web page http://ceph.com/docs/master/cephfs/hadoop 
> shows a configuration parameter ceph.object.size.
> Is it the CEPH equivalent  to the "HDFS block size" parameter which I have 
> been looking for?

Yes. By specifying ceph.object.size, the Hadoop will use a default
Ceph file layout with stripe unit = object size, and stripe count = 1.
This is effectively the same meaning as dfs.block.size for HDFS.

> Does the parameter ceph.object.size apply to version 0.56.1?

The Ceph/Hadoop file system plugin is being developed here:

  git://github.com/ceph/hadoop-common cephfs/branch-1.0

There is an old version of the Hadoop plugin in the Ceph tree which
will be removed shortly. Regarding the versions, development is taking
place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
system in place for dealing with compatibility across versions because
the code is in heavy development.

If you are running 0.56.1 then a recent version of cephfs/branch-1.0
should work with that, but may not long, as development continues.

> I would be interested in setting this parameter to values higher than 64MB, 
> e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing 
> the performance of the TeraSort benchmark. Would these values be allowed and 
> would they at all make sense for the mechanisms used in CEPH?

I can't think of any reason why a large size would cause concern, but
maybe someone else can chime in?

- Noah
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-09 Thread Sébastien Han
I guess he runs Argonaut as well.

More suggestions about this problem?

Thanks!

--
Regards,
Sébastien Han.


On Mon, Jan 7, 2013 at 8:09 PM, Samuel Just  wrote:
>
> Awesome!  What version are you running (ceph-osd -v, include the hash)?
> -Sam
>
> On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano  wrote:
> > This failed the first time I sent it, so I'm resending in plain text.
> >
> > Dave Spano
> > Optogenics
> > Systems Administrator
> >
> >
> >
> > - Original Message -
> >
> > From: "Dave Spano" 
> > To: "Sébastien Han" 
> > Cc: "ceph-devel" , "Samuel Just" 
> > 
> > Sent: Monday, January 7, 2013 12:40:06 PM
> > Subject: Re: OSD memory leaks?
> >
> >
> > Sam,
> >
> > Attached are some heaps that I collected today. 001 and 003 are just after 
> > I started the profiler; 011 is the most recent. If you need more, or 
> > anything different let me know. Already the OSD in question is at 38% 
> > memory usage. As mentioned by Sèbastien, restarting ceph-osd keeps things 
> > going.
> >
> > Not sure if this is helpful information, but out of the two OSDs that I 
> > have running, the first one (osd.0) is the one that develops this problem 
> > the quickest. osd.1 does have the same issue, it just takes much longer. Do 
> > the monitors hit the first osd in the list first, when there's activity?
> >
> >
> > Dave Spano
> > Optogenics
> > Systems Administrator
> >
> >
> > - Original Message -
> >
> > From: "Sébastien Han" 
> > To: "Samuel Just" 
> > Cc: "ceph-devel" 
> > Sent: Friday, January 4, 2013 10:20:58 AM
> > Subject: Re: OSD memory leaks?
> >
> > Hi Sam,
> >
> > Thanks for your answer and sorry the late reply.
> >
> > Unfortunately I can't get something out from the profiler, actually I
> > do but I guess it doesn't show what is supposed to show... I will keep
> > on trying this. Anyway yesterday I just thought that the problem might
> > be due to some over usage of some OSDs. I was thinking that the
> > distribution of the primary OSD might be uneven, this could have
> > explained that some memory leaks are more important with some servers.
> > At the end, the repartition seems even but while looking at the pg
> > dump I found something interesting in the scrub column, timestamps
> > from the last scrubbing operation matched with times showed on the
> > graph.
> >
> > After this, I made some calculation, I compared the total number of
> > scrubbing operation with the time range where memory leaks occurred.
> > First of all check my setup:
> >
> > root@c2-ceph-01 ~ # ceph osd tree
> > dumped osdmap tree epoch 859
> > # id weight type name up/down reweight
> > -1 12 pool default
> > -3 12 rack lc2_rack33
> > -2 3 host c2-ceph-01
> > 0 1 osd.0 up 1
> > 1 1 osd.1 up 1
> > 2 1 osd.2 up 1
> > -4 3 host c2-ceph-04
> > 10 1 osd.10 up 1
> > 11 1 osd.11 up 1
> > 9 1 osd.9 up 1
> > -5 3 host c2-ceph-02
> > 3 1 osd.3 up 1
> > 4 1 osd.4 up 1
> > 5 1 osd.5 up 1
> > -6 3 host c2-ceph-03
> > 6 1 osd.6 up 1
> > 7 1 osd.7 up 1
> > 8 1 osd.8 up 1
> >
> >
> > And there are the results:
> >
> > * Ceph node 1 which has the most important memory leak performed 1608
> > in total and 1059 during the time range where memory leaks occured
> > * Ceph node 2, 1168 in total and 776 during the time range where
> > memory leaks occured
> > * Ceph node 3, 940 in total and 94 during the time range where memory
> > leaks occurred
> > * Ceph node 4, 899 in total and 191 during the time range where
> > memory leaks occurred
> >
> > I'm still not entirely sure that the scrub operation causes the leak
> > but the only relevant relation that I found...
> >
> > Could it be that the scrubbing process doesn't release memory? Btw I
> > was wondering, how ceph decides at what time it should run the
> > scrubbing operation? I know that it's once a day and control by the
> > following options
> >
> > OPTION(osd_scrub_min_interval, OPT_FLOAT, 300)
> > OPTION(osd_scrub_max_interval, OPT_FLOAT, 60*60*24)
> >
> > But how ceph determined the time where the operation started, during
> > cluster creation probably?
> >
> > I just checked the options that control OSD scrubbing and found that by 
> > default:
> >
> > OPTION(osd_max_scrubs, OPT_INT, 1)
> >
> > So that might explain why only one OSD uses a lot of memory.
> >
> > My dirty workaround at the moment is to performed a check of memory
> > use by every OSD and restart it if it uses more than 25% of the total
> > memory. Also note that on ceph 1, 3 and 4 it's always one OSD that
> > uses a lot of memory, for ceph 2 only the mem usage is high but almost
> > the same for all the OSD process.
> >
> > Thank you in advance.
> >
> > --
> > Regards,
> > Sébastien Han.
> >
> >
> > On Wed, Dec 19, 2012 at 10:43 PM, Samuel Just  wrote:
> >>
> >> Sorry, it's been very busy. The next step would to try to get a heap
> >> dump. You can start a heap profile on osd N by:
> >>
> >> ceph osd tell N heap start_profiler
> >>
> >> and you can get it to dump the collected profile using
> >>
> >> ceph osd tell N h

RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2013-01-09 Thread Lachfeld, Jutta
Hi Noah,

the current content of the web page http://ceph.com/docs/master/cephfs/hadoop 
shows a configuration parameter ceph.object.size.
Is it the CEPH equivalent  to the "HDFS block size" parameter which I have been 
looking for?

Does the parameter ceph.object.size apply to version 0.56.1?

I would be interested in setting this parameter to values higher than 64MB, 
e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing 
the performance of the TeraSort benchmark. Would these values be allowed and 
would they at all make sense for the mechanisms used in CEPH?

Regards,
Jutta.

-
jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE 
SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company 
Details: http://de.ts.fujitsu.com/imprint

> -Original Message-
> From: Noah Watkins [mailto:jayh...@cs.ucsc.edu]
> Sent: Thursday, December 13, 2012 9:33 PM
> To: Gregory Farnum
> Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; 
> Noah
> Watkins; Joe Buck
> Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark
> performance comparison issue
> 
> The bindings use the default Hadoop settings (e.g. 64 or 128 MB
> chunks) when creating new files. The chunk size can also be specified on a 
> per-file basis
> using the same interface as Hadoop. Additionally, while Hadoop doesn't 
> provide an
> interface to configuration parameters beyond chunk size, we will also let 
> users fully
> configure for any Ceph striping strategy. 
> http://ceph.com/docs/master/dev/file-striping/
> 
> -Noah
> 
> On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum  wrote:
> > On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar  wrote:
> >> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but 
> >> even statically
> configurable when a cluster is first installed?
> >
> > Yeah. You can set chunk size on a per-file basis; you just can't
> > change it once the file has any data written to it.
> > In the context of Hadoop the question is just if the bindings are
> > configured correctly to do so automatically.
> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majord...@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: Crushmap Design Question

2013-01-09 Thread Joao Eduardo Luis
On 01/09/2013 08:59 AM, Wido den Hollander wrote:
> Hi,
> 
> On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
>> Hi,
>>  Setting rep size to 3 only make the data triple-replication, that means 
>> when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
>>  But Monitor is another story, for monitor clusters with 2N+1 nodes, it 
>> require at least N+1 nodes alive, and indeed this is why you Ceph failed.
>>  It looks to me this discipline make it hard to design a proper 
>> deployment which is robust in DC outage. But hoping for inputs from 
>> community,how to make Monitor cluster reliable.
>>
> 
>  From what I understand he didn't kill the second mon, still leaving 2
> out of 3 mons running.

Indeed. A good hint that this is the case is this bit of Shawn's message:

>> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
>> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 
>> active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 
>> 16362/49086 degraded (33.333%)
>>
>> At this point everything is still ok.  But when I fail the 2nd datacenter 
>> (still leaving 2 out of 3 mons running) I get:
>> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 
>> incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

If you still manage to get these messages, it means your monitors are
still handling and answering requests, and that only happens when you
have a quorum :)

  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Crushmap Design Question

2013-01-09 Thread Moore, Shawn M
Correct, it never went below N+1 (3 total mons and 2 of them never went down).

Several times in the past I verified that a pg was actually mapped to valid 
dc's with that command.  I just wrote a quick script that will do this on the 
fly and after recovering the cluster last night, every pg has an osd mapping 
respective to an osd in a dc.  I will fail the cluster again later today and 
see what it looks like after 1 dc fails and then again after the 2nd fails.

As far as the weighting goes, I'm not sure how I ended up this way.  So should 
I change the "adm" tree:
FROM
-25 8   datacenter adm
-16 8   host admdisk0
TO
-25 36  datacenter adm
-16 1   host admdisk0

Regards


-Original Message-
From: Wido den Hollander [mailto:w...@widodh.nl] 
Sent: Wednesday, January 09, 2013 4:00 AM
To: Chen, Xiaoxi
Cc: Moore, Shawn M; ceph-devel@vger.kernel.org
Subject: Re: Crushmap Design Question

Hi,

On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
> Hi,
>   Setting rep size to 3 only make the data triple-replication, that means 
> when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
>   But Monitor is another story, for monitor clusters with 2N+1 nodes, it 
> require at least N+1 nodes alive, and indeed this is why you Ceph failed.
>   It looks to me this discipline make it hard to design a proper 
> deployment which is robust in DC outage. But hoping for inputs from 
> community,how to make Monitor cluster reliable.
> 

From what I understand he didn't kill the second mon, still leaving 2
out of 3 mons running.

Could you check if your PGs are actually mapped to OSDs spread out over
the 3 DCs?

"ceph pg dump" should tell you to which OSDs the PGs are mapped.

I've never tried before, but you don't have equal weights for the
datacenters, I don't know how that effects the situation.

Wido

>   
>   
>   Xiaoxi
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Moore, Shawn M
> Sent: 2013年1月9日 4:21
> To: ceph-devel@vger.kernel.org
> Subject: Crushmap Design Question
> 
> I have been testing ceph for a little over a month now.  Our design goal is 
> to have 3 datacenters in different buildings all tied together over 10GbE.  
> Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  
> In the third is one large server with 16 SAS disks serving 8 osds.  
> Eventually we will add one more identical large server into the third 
> datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap 
> in such a way that as long as a majority of mon's can stay up, we could run 
> off of one datacenter's worth of osds.   So in my testing, it doesn't work 
> out quite this way...
> 
> Everything is currently ceph version 0.56.1 
> (e4a541624df62ef353e754391cbbb707f54b16f7)
> 
> I will put hopefully relevant files at the end of this email.
> 
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 
> active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 
> active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 
> 16362/49086 degraded (33.333%)
> 
> At this point everything is still ok.  But when I fail the 2nd datacenter 
> (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 
> incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> Most VM's quit working and "rbd ls" works, but not a single line from "rados 
> -p rbd ls" works and the command hangs.  Now after a while (you can see from 
> timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 
> 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 
> remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 
> 7696/49086 degraded (15.679%)

> 
> I'm hoping I've done something wrong, so please advise.  Below are my 
> configs.  If you need something more to help, just ask.
> 
> Normal output with all datacenters up.
> # ceph osd tree
> # id  weight  type name   up/down reweight
> -180  root default
> -336  datacenter hok
> -21   host blade151
> 0 1   osd.0   up  1   
> -41   host blade152
> 1 1   osd.1   up  1   
> -15   1   host blade153
> 2 1

Re: Windows port

2013-01-09 Thread Florian Haas
On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey  wrote:
> Hi,
>
> I am also curious if a Windows port, specifically the client-side, is
> on the roadmap.

This is somewhat OT from the original post, but if all you're
interested is using RBD block storage from Windows, you can already do
that by going through an iSCSI or FC head node. Proof-of-concept
configuration outlined here:

http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices

Not sure if this helps, but just thought I'd mention it.

Cheers,
Florian

-- 
Helpful information? Let us know!
http://www.hastexo.com/shoutbox
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Mark Nelson

On 01/09/2013 06:51 AM, Lachfeld, Jutta wrote:

Hi all,

in expectation of better performance, we are just switching from CEPH version 
0.48 to 0.56.1
for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.

We are now wondering whether there are currently any development activities
concerning further significant performance enhancements,
or whether further significant performance enhancements are already planned for 
the near future.

I would now be loath to start benchmarking with 0.56.1 and then, a month or so 
later, detect that there have been significant performance enhancements in CEPH 
in the meantime.

Regards,
Jutta.

-
jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, 
"Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: 
http://de.ts.fujitsu.com/imprint
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hi Jutta,

As Wido mentioned there have been some performance improvements, 
especially with small IO sizes.  The conclusion section of the 
performance preview may be useful for you:


http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/

One oddity is that there may have been some regression for 128k reads. 
Overall though I'd say that performance has improved, especially on XFS.


I don't think it's likely we will be pushing any performance patches to 
the bobtail series, but it's possible performance could change as a 
result of a bug fix.


For what it's worth, I've started performing sweeps over ceph parameter 
spaces (and looking at underlying io schedulers) to see how tuning 
affects ceph performance under different scenarios.  I'm hoping to be 
able to release the results later this month.


Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Mark Kampe

Performance work is always ongoing, but I am not aware of any
significant imminent enhancements.  We are just wrapping up an
investigation of the effects of various file system and I/O
options on different types of traffic, and the next major area
of focus will be RADOS Block Device and VMs over RBD.  This is
pretty far away from Hadoop and probably won't yield much fruit
until March.

There are a few people working on Hadoop integration, and I
have not been closely following their activities, but I do
not believe that any major performance work will be forthcoming
in the next few weeks

On 01/09/2013 04:51 AM, Lachfeld, Jutta wrote:

Hi all,

in expectation of better performance, we are just switching from CEPH version 
0.48 to 0.56.1
for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.

We are now wondering whether there are currently any development activities
concerning further significant performance enhancements,
or whether further significant performance enhancements are already planned for 
the near future.

I would now be loath to start benchmarking with 0.56.1 and then, a month or so 
later, detect that there have been significant performance enhancements in CEPH 
in the meantime.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD's slow down to a crawl

2013-01-09 Thread Mark Nelson

On 01/09/2013 02:52 AM, Matthew Anderson wrote:

Hi Sage,

Sorry for the late follow up, I've been on a bit of a testing rampage and 
managed to somewhat sort the problem.

Most of the problems appears to be from the 3.7.1 kernel. It seems to have a fairly 
big issue with its networking stack that was causing Ceph's network operations to 
hang. Moving back to a 3.6.8 kernel fixed this up. I don't know the full extent of 
the problem but it was reported on Phoronix briefly here - 
http://www.phoronix.com/scan.php?page=news_item&px=MTI2Nzc

The second issue was BTRFS on both the 3.7 and 3.6.8 kernels. After running a 
long rados bench (10 minutes) on a fresh cluster it would often slow down 
significantly by going from 250MB/s down to a 50MB/s average. Latency also 
increased dramatically. Restarting the OSD's fixes the issue but after a while 
it slows right down again. In the end I re-formatted the cluster using XFS (and 
also EXT4 for benchmarks) and there wasn't a single issue. I had rados bench 
running for over 30 minutes from another machine and there wasn't a single 
issue.


Ah, too bad this is still happening. :(  It's interesting though that 
restarting the OSDs fixes it.  That's not something I expected.  Sounds 
like I need to run some more tests again and see if I can get to the 
bottom of it.




At thisstage  I need to start moving into production with XFS. My test cluster 
arrives in a few weeks so I should be able to come back to the BTRFS issue 
later on as it would be very handy to have compression working.

Thanks again for your help
-Matt


-Original Message-
From: Sage Weil [mailto:s...@inktank.com]
Sent: Saturday, 22 December 2012 12:02 AM
To: Matthew Anderson
Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
Subject: RE: OSD's slow down to a crawl

On Fri, 21 Dec 2012, Matthew Anderson wrote:

Hi Sage,

I've tried to reproduce the error again with logging on every OSD and
got the above. RADOS bench had stalled on a write request like the
last time and the attached log is the grep'd OSD log (# cat osd.25.log
| grep client.9501.0:744>  freeze.log) . The OSD that stalled was 25,
pg map is below -

# ceph pg map 6.5d83495b
osdmap e3775 pg 6.5d83495b (6.95b) ->  up [25,31] acting [25,31]

I hope that's what you were after, if not just let me know


We're getting closer.  The osd tried to send the reply.  Can you reproduce with 
'debug ms = 20' on the osds too, and on the client side do soemthing like

  rados --debug-ms 20 --debug-objecter 20 --log-file /tmp/foo ...

Thanks!
sage




Thanks again
-Matt


-Original Message-
From: Sage Weil [mailto:s...@inktank.com]
Sent: Friday, 21 December 2012 1:14 AM
To: Matthew Anderson
Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
Subject: RE: OSD's slow down to a crawl

On Thu, 20 Dec 2012, Matthew Anderson wrote:

Hi Sage,

Logs are attached. I took the osd logs from osd.24 as this is the
first osd in my SSD pool I've been testing with previously.

The 4MB bench I was able to reproduce the fault by restarting my rbd
export which stalled after a few percent complete. When I ran the
4MB bench it stalled early on and never received a response back
from the OSD and I terminated it after 60 seconds or so. I wasn't
able to reproduce the fault using the 4kb io size. The 4kb log
should show rados bench completing normally at a respectable speed of about 
1MB/s.


Let's drill into the hang.. up until that point things look okay.

2012-12-21 00:51:26.033622 7f6f3c042760  1 -- 172.16.0.13:0/1023886
-->  172.16.0.13:6813/22233 -- osd_op(client.9503.0:185
benchmark_data_KVM04_23886_object184 [write 0~4194304] 6.3ca4346e) v4
-- ?+0 0x171ea50 con 0x171a7e0

Do you have a log for that OSD so we can see what happened there?  It
may also be that the replicated write is hung.  If you do

  ceph pg map 6.3ca4346e

you can see all OSDs storing that PG.  And/or you can grep for
client.9503.0:185 in 172.16.0.13:6813/22233's log and see whether the sub_op 
was sent.

Thanks!
sage




Thanks
-Matt

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Friday, 21 December 2012 12:30 AM
To: Matthew Anderson
Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
Subject: RE: OSD's slow down to a crawl

Can you do a similar test, but with full logging on?

  ceph tell osd.0 injectargs '--debug-ms 1 --debug-filestore 20
--debug-osd
20 --debug-journal 20'
  rados -p ssd bench 30 write -b 4096 -t 1 --log-file /tmp/foo
--debug-ms 1

That will be a single IO in flight at a time and very easy to trace through the 
logs.  If you can post the resulting log file (/tmp/foo and from osd.0), that 
would be awesome.

Thanks!
sage



On Thu, 20 Dec 2012, Matthew Anderson wrote:


# rados bench 60 write -t 256 -p ssd  Maintaining 256 concurrent
writes of 4194304 bytes for at least 60 seconds.
  Object prefix: benchmark_data_KVM03_12985
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg 

Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Dennis Jacobfeuerborn
On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote:
> Hi all,
> 
> in expectation of better performance, we are just switching from CEPH version 
> 0.48 to 0.56.1
> for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.
> 
> We are now wondering whether there are currently any development activities 
> concerning further significant performance enhancements, 
> or whether further significant performance enhancements are already planned 
> for the near future.
> 
> I would now be loath to start benchmarking with 0.56.1 and then, a month or 
> so later, detect that there have been significant performance enhancements in 
> CEPH in the meantime.

There shouldn't be any major changes since v0.56.x is a stable release and
as such should only receive bug-/securityfixes and non-risky improvements.
Any changes that would result in a significant change in performance would
probably be too disruptive for a stable release series.

Regards,
  Dennis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Christopher Kunz
Hi,
> 
> Yes, 0.56(.1) has a significant performance increase compared to 0.48
> 
That is not exactly the OP's question, though. If I understand
correctly, she is concerned about ongoing performance improvements
within the "bobtail" branch, i.e. between 0.56.1 and 0.56.X (with X>1).

Jutta, what kind of use case do you have in mind, i.e. how complex are
your benchmarking scenarios?

Regards,

--ck
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Wido den Hollander

On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote:

Hi all,

in expectation of better performance, we are just switching from CEPH version 
0.48 to 0.56.1
for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.

We are now wondering whether there are currently any development activities
concerning further significant performance enhancements,
or whether further significant performance enhancements are already planned for 
the near future.



Yes, 0.56(.1) has a significant performance increase compared to 0.48

Two blogposts which might be interesting to read:
* http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/
* http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/

I'm not running with HDFS, but I see a good performance increase with 
Virtual Machines running on RBD.


Wido


I would now be loath to start benchmarking with 0.56.1 and then, a month or so 
later, detect that there have been significant performance enhancements in CEPH 
in the meantime.

Regards,
Jutta.

-
jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, 
"Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: 
http://de.ts.fujitsu.com/imprint
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Lachfeld, Jutta
Hi all,

in expectation of better performance, we are just switching from CEPH version 
0.48 to 0.56.1
for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.

We are now wondering whether there are currently any development activities 
concerning further significant performance enhancements, 
or whether further significant performance enhancements are already planned for 
the near future.

I would now be loath to start benchmarking with 0.56.1 and then, a month or so 
later, detect that there have been significant performance enhancements in CEPH 
in the meantime.

Regards,
Jutta.

-
jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE 
SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company 
Details: http://de.ts.fujitsu.com/imprint
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crushmap Design Question

2013-01-09 Thread Wido den Hollander
Hi,

On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
> Hi,
>   Setting rep size to 3 only make the data triple-replication, that means 
> when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
>   But Monitor is another story, for monitor clusters with 2N+1 nodes, it 
> require at least N+1 nodes alive, and indeed this is why you Ceph failed.
>   It looks to me this discipline make it hard to design a proper 
> deployment which is robust in DC outage. But hoping for inputs from 
> community,how to make Monitor cluster reliable.
> 

>From what I understand he didn't kill the second mon, still leaving 2
out of 3 mons running.

Could you check if your PGs are actually mapped to OSDs spread out over
the 3 DCs?

"ceph pg dump" should tell you to which OSDs the PGs are mapped.

I've never tried before, but you don't have equal weights for the
datacenters, I don't know how that effects the situation.

Wido

>   
>   
>   Xiaoxi
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Moore, Shawn M
> Sent: 2013年1月9日 4:21
> To: ceph-devel@vger.kernel.org
> Subject: Crushmap Design Question
> 
> I have been testing ceph for a little over a month now.  Our design goal is 
> to have 3 datacenters in different buildings all tied together over 10GbE.  
> Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  
> In the third is one large server with 16 SAS disks serving 8 osds.  
> Eventually we will add one more identical large server into the third 
> datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap 
> in such a way that as long as a majority of mon's can stay up, we could run 
> off of one datacenter's worth of osds.   So in my testing, it doesn't work 
> out quite this way...
> 
> Everything is currently ceph version 0.56.1 
> (e4a541624df62ef353e754391cbbb707f54b16f7)
> 
> I will put hopefully relevant files at the end of this email.
> 
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 
> active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 
> active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 
> 16362/49086 degraded (33.333%)
> 
> At this point everything is still ok.  But when I fail the 2nd datacenter 
> (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 
> incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> Most VM's quit working and "rbd ls" works, but not a single line from "rados 
> -p rbd ls" works and the command hangs.  Now after a while (you can see from 
> timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 
> 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 
> remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 
> 7696/49086 degraded (15.679%)
> 
> I'm hoping I've done something wrong, so please advise.  Below are my 
> configs.  If you need something more to help, just ask.
> 
> Normal output with all datacenters up.
> # ceph osd tree
> # id  weight  type name   up/down reweight
> -180  root default
> -336  datacenter hok
> -21   host blade151
> 0 1   osd.0   up  1   
> -41   host blade152
> 1 1   osd.1   up  1   
> -15   1   host blade153
> 2 1   osd.2   up  1   
> -17   1   host blade154
> 3 1   osd.3   up  1   
> -18   1   host blade155
> 4 1   osd.4   up  1   
> -19   1   host blade159
> 5 1   osd.5   up  1   
> -20   1   host blade160
> 6 1   osd.6   up  1   
> -21   1   host blade161
> 7 1   osd.7   up  1   
> -22   1   host blade162
> 8 1   osd.8   up  1   
> -23   1   host blade163
> 9 1   osd.9   up  1   
> -24   36  datacenter csc
> -51   host admbc0-01
> 101   osd.10  up  1   
> -61

RE: OSD's slow down to a crawl

2013-01-09 Thread Matthew Anderson
Hi Sage,

Sorry for the late follow up, I've been on a bit of a testing rampage and 
managed to somewhat sort the problem.

Most of the problems appears to be from the 3.7.1 kernel. It seems to have a 
fairly big issue with its networking stack that was causing Ceph's network 
operations to hang. Moving back to a 3.6.8 kernel fixed this up. I don't know 
the full extent of the problem but it was reported on Phoronix briefly here - 
http://www.phoronix.com/scan.php?page=news_item&px=MTI2Nzc

The second issue was BTRFS on both the 3.7 and 3.6.8 kernels. After running a 
long rados bench (10 minutes) on a fresh cluster it would often slow down 
significantly by going from 250MB/s down to a 50MB/s average. Latency also 
increased dramatically. Restarting the OSD's fixes the issue but after a while 
it slows right down again. In the end I re-formatted the cluster using XFS (and 
also EXT4 for benchmarks) and there wasn't a single issue. I had rados bench 
running for over 30 minutes from another machine and there wasn't a single 
issue. 

At thisstage  I need to start moving into production with XFS. My test cluster 
arrives in a few weeks so I should be able to come back to the BTRFS issue 
later on as it would be very handy to have compression working.

Thanks again for your help
-Matt   


-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Saturday, 22 December 2012 12:02 AM
To: Matthew Anderson
Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
Subject: RE: OSD's slow down to a crawl

On Fri, 21 Dec 2012, Matthew Anderson wrote:
> Hi Sage,
> 
> I've tried to reproduce the error again with logging on every OSD and 
> got the above. RADOS bench had stalled on a write request like the 
> last time and the attached log is the grep'd OSD log (# cat osd.25.log 
> | grep client.9501.0:744 > freeze.log) . The OSD that stalled was 25, 
> pg map is below -
> 
> # ceph pg map 6.5d83495b
> osdmap e3775 pg 6.5d83495b (6.95b) -> up [25,31] acting [25,31]
> 
> I hope that's what you were after, if not just let me know

We're getting closer.  The osd tried to send the reply.  Can you reproduce with 
'debug ms = 20' on the osds too, and on the client side do soemthing like

 rados --debug-ms 20 --debug-objecter 20 --log-file /tmp/foo ...

Thanks! 
sage


> 
> Thanks again
> -Matt
> 
> 
> -Original Message-
> From: Sage Weil [mailto:s...@inktank.com]
> Sent: Friday, 21 December 2012 1:14 AM
> To: Matthew Anderson
> Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
> Subject: RE: OSD's slow down to a crawl
> 
> On Thu, 20 Dec 2012, Matthew Anderson wrote:
> > Hi Sage,
> > 
> > Logs are attached. I took the osd logs from osd.24 as this is the 
> > first osd in my SSD pool I've been testing with previously.
> > 
> > The 4MB bench I was able to reproduce the fault by restarting my rbd 
> > export which stalled after a few percent complete. When I ran the 
> > 4MB bench it stalled early on and never received a response back 
> > from the OSD and I terminated it after 60 seconds or so. I wasn't 
> > able to reproduce the fault using the 4kb io size. The 4kb log 
> > should show rados bench completing normally at a respectable speed of about 
> > 1MB/s.
> 
> Let's drill into the hang.. up until that point things look okay.
> 
> 2012-12-21 00:51:26.033622 7f6f3c042760  1 -- 172.16.0.13:0/1023886 
> --> 172.16.0.13:6813/22233 -- osd_op(client.9503.0:185 
> benchmark_data_KVM04_23886_object184 [write 0~4194304] 6.3ca4346e) v4 
> -- ?+0 0x171ea50 con 0x171a7e0
> 
> Do you have a log for that OSD so we can see what happened there?  It 
> may also be that the replicated write is hung.  If you do
> 
>  ceph pg map 6.3ca4346e
> 
> you can see all OSDs storing that PG.  And/or you can grep for
> client.9503.0:185 in 172.16.0.13:6813/22233's log and see whether the sub_op 
> was sent.
> 
> Thanks!
> sage
> 
> 
> > 
> > Thanks
> > -Matt
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Friday, 21 December 2012 12:30 AM
> > To: Matthew Anderson
> > Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org
> > Subject: RE: OSD's slow down to a crawl
> > 
> > Can you do a similar test, but with full logging on?
> > 
> >  ceph tell osd.0 injectargs '--debug-ms 1 --debug-filestore 20 
> > --debug-osd
> > 20 --debug-journal 20'
> >  rados -p ssd bench 30 write -b 4096 -t 1 --log-file /tmp/foo 
> > --debug-ms 1
> > 
> > That will be a single IO in flight at a time and very easy to trace through 
> > the logs.  If you can post the resulting log file (/tmp/foo and from 
> > osd.0), that would be awesome.
> > 
> > Thanks!
> > sage
> > 
> > 
> > 
> > On Thu, 20 Dec 2012, Matthew Anderson wrote:
> > 
> > > # rados bench 60 write -t 256 -p ssd  Maintaining 256 concurrent 
> > > writes of 4194304 bytes for at least 60 seconds.
> > >  Object prefix: benchmark_data_KVM03_12985
> > >sec Cur ops   started  finished  avg MB/s  cur MB/s  l

Re: Is Ceph recovery able to handle massive crash

2013-01-09 Thread Denis Fondras

Hello,

Le 09/01/2013 00:36, Gregory Farnum a écrit :


It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.



I believe it is a BTRFS issue as when I mkfs.btrfs the volume and rejoin 
it to the cluster, it works (OSD is staying up).


Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html