Re: [ceph-users] Rack weight imbalance

2016-02-24 Thread Chen, Xiaoxi
My 0.02, there are two kinds of balance, one for space utilization , another for performance. Now seems you will be good for the space utilization, but you might suffer a bit for the performance as the density of disk increase.The new rack will hold 1/3 data by 1/5 disks, if we assume the

Re: [ceph-users] [performance] why rbd_aio_write latency increase from 4ms to 7.3ms after the same test

2015-11-01 Thread Chen, Xiaoxi
Pre-allocated the volume by "DD" across the entire RBD before you do any performance test:). In this case, you may want to re-create the RBD, pre-allocate and try again. > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf

Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-14 Thread Chen, Xiaoxi
Hi Mark, The Async result in 128K drops quickly after some point, is that because of the testing methodology? Other conclusion looks to me like simple messenger + Jemalloc is the best practice till now as it has the same performance as async but using much less memory?

Re: [ceph-users] How to dispatch monitors in a multi-site cluster (ie in 2 datacenters)

2015-04-13 Thread Chen, Xiaoxi
Hi Francois, Actually you are discussing two separate questions here:) 1. in the 5 mons(2 in dc1, 2 in dc2, 1 in wan), can the monitor form a quorum? How to offload the mon in WAN? Yes and No, in one case, you lose any of your DC completely, that's fine, the left 3 monitors could

Re: [ceph-users] question about OSD failure detection

2015-04-13 Thread Chen, Xiaoxi
Hi, 1. In short, the OSD need to heartbeat with up to #PG x (#Replica -1 ), but actually will be much less since most of the peers are redundant. For example, An OSD (say OSD 1) is holding 100 PGs, especially for some PGs, say PG 1, OSD1 is the primary OSD of PG1, then OSD1 need to

[ceph-users] 回复: Re: rbd resize (shrink) taking forever and a day

2015-01-07 Thread Chen, Xiaoxi
to be trimmed. I'm not a big fan of a --skip-trimming option as there is the potential to leave some orphan objects that may not be cleaned up correctly. On Tue, Jan 6, 2015 at 8:09 AM, Jake Young jak3...@gmail.com wrote: On Monday, January 5, 2015, Chen, Xiaoxi xiaoxi.c...@intel.com wrote

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-06 Thread Chen, Xiaoxi
do you think? From: Jake Young [mailto:jak3...@gmail.com] Sent: Monday, January 5, 2015 9:45 PM To: Chen, Xiaoxi Cc: Edwin Peer; ceph-users@lists.ceph.com Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day On Sunday, January 4, 2015, Chen, Xiaoxi xiaoxi.c

Re: [ceph-users] Worthwhile setting up Cache tier with small leftover SSD partions?

2015-01-05 Thread Chen, Xiaoxi
Some low level caching might help, flashcache, dmcache,etc… But that may hurt the reliability to some extent , and make it harder for operator ☺ From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Lindsay Mathieson Sent: Monday, January 5, 2015 12:14 PM To: Christian

Re: [ceph-users] redundancy with 2 nodes

2015-01-04 Thread Chen, Xiaoxi
Did you shut down the node with 2 mon? I think it might be impossible to have redundancy with only 2 node, paxos quorum is the reason: Say you have N (N=2K+1) monitors, you always have a node(let's named it node A) with majority number of MONs(= K+1), another node(node B) with minority number

Re: [ceph-users] Ceph data consistency

2014-12-30 Thread Chen, Xiaoxi
Hi, First of all, the data is safe since it's persistent in journal, if error occurs on OSD data partition, replay the journal will get the data back. And, there is a wbthrottle there, you can config how much data(ios, bytes, inodes) you wants to remain in memory. A background thread

Re: [ceph-users] 答复: Re: can not add osd

2014-12-28 Thread Chen, Xiaoxi
Hi Yang bin, Not sure if you followed the right docs. I suspect you didn’t, because you should use ceph-disk and specified a FS-Type in the command. I think you might mislead by the quick start(http://ceph.com/docs/master/start/quick-ceph-deploy/#create-a-cluster), it use a directory

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi
We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi
had better off to optimize the key-value backend code to support specified kind of load. From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Monday, December 1, 2014 10:14 PM To: Chen, Xiaoxi Cc: Satoru Funai; ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi
...@gmail.com] Sent: Tuesday, December 2, 2014 1:27 PM To: Chen, Xiaoxi Cc: ceph-us...@ceph.com; Haomai Wang Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Hi Xiaoxi, Thanks for very useful information. Can you share more details about Terrible bad performance is compare

Re: [ceph-users] prioritizing reads over writes

2014-11-02 Thread Chen, Xiaoxi
Hi Simon Do your workload has lots of RAW? Since Ceph has RW lock in each object, so if you have a write to RBD and the following read happen to hit the same object, the latency will be higher. Another possibility is the OSD op_wq, it’s a priority queue but read and write have same

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Chen, Xiaoxi
Hi Chris, I am not the expert of LIO but from your result, seems RBD/Ceph works well(RBD on local system, no iSCSI) and LIO works well(Ramdisk (No RBD) - LIO target) , and if you change LIO to use other interface (file, loopback) to play with RBD, it also works well. So

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-18 Thread Chen, Xiaoxi
Hi Mark It's client IOPS and we use replica = 2, journal and OSD are hosted in the same SSDs so the real IOPS is 23K * 2 * 2 =90K, still far from HW limit (30K+ for a single DCS3700) CPU % is ~62% in peak (2VM ), interrupt distributed. An additional information, seems the cluster is in a

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-10 Thread Chen, Xiaoxi
Could you show your cache tiering configuration? Especially this three parameters. ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 ceph osd pool set hot-storage cache_target_full_ratio 0.8 ceph osd pool set {cachepool} target_max_bytes {#bytes} From: ceph-users

Re: [ceph-users] ceph data consistency

2014-09-09 Thread Chen, Xiaoxi
Yes, but usually a system has several layer of error-detecting/recovering stuff in different granularity. Disk CRC works on Sector level, Ceph CRC mostly work on object level, and we also have replication/erasure coding in system level. The CRC in ceph mainly handle the case, imaging you have

Re: [ceph-users] Sequential placement

2013-08-26 Thread Chen, Xiaoxi
The random may come from ceph trunks. For RBD, Ceph trunk the image to 4M(default) objects, for Rados bench , it already 4M objects if you didn't set the parameters. So from XFS's view, there are lots of 4M files, in default, with ag!=1 (allocation group, specified during mkfs, default seems to

Re: [ceph-users] SSD recommendations for OSD journals

2013-07-22 Thread Chen, Xiaoxi
发自我的 iPhone 在 2013-7-23,0:21,Gandalf Corvotempesta gandalf.corvotempe...@gmail.com 写道: 2013/7/22 Chen, Xiaoxi xiaoxi.c...@intel.com: Imaging you have several writes have been flushed to journal and acked,but not yet write to disk. Now the system crash by kernal panic or power failure,you

Re: [ceph-users] SSD suggestions as journal

2013-07-22 Thread Chen, Xiaoxi
Basically i think endurance is most important for a ceph journal,since the workload for journal is full write,you can easily caculate how long your ssd will burn out.. even we assume your ssd only run at 100MB/s in average,you will burn out 8TB/day and 240TB/month DCS 3500 is definitely not

Re: [ceph-users] Any concern about Ceph on CentOS

2013-07-17 Thread Chen, Xiaoxi
PM To: Chen, Xiaoxi Cc: ceph-de...@vger.kernel.org; ceph-us...@ceph.com Subject: Re: Any concern about Ceph on CentOS Hi Xiaoxi, we are really running Ceph on CentOS-6.4 (6 server nodes, 3 client nodes, 160 OSDs). We put a 3.8.13 Kernel on top and installed the ceph-0.61.4 cluster with mkcephfs

Re: [ceph-users] Any concern about Ceph on CentOS

2013-07-16 Thread Chen, Xiaoxi
of the issue is that for the actual cluster it's self, it should be ok. I could be wrong here, but I thought the kernel module was only specifically for mounting cephfs (And even then, there's a fuse module that you *can* use anyway) On 07/17/2013 11:18 AM, Chen, Xiaoxi wrote: Hi list, I would

Re: [ceph-users] How many Pipe per Ceph OSD daemon will keep?

2013-06-06 Thread Chen, Xiaoxi
threads. This is still too high for 8 core or 16 core cpu/cpus and will waste a lot of cycles in context switchinh. 发自我的 iPhone 在 2013-6-7,0:21,Gregory Farnum g...@inktank.com 写道: On Thu, Jun 6, 2013 at 12:25 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi, From the code, each pipe

Re: [ceph-users] Ceph killed by OS because of OOM under high load

2013-06-04 Thread Chen, Xiaoxi
-Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: 2013年6月4日 0:37 To: Chen, Xiaoxi Cc: ceph-de...@vger.kernel.org; Mark Nelson (mark.nel...@inktank.com); ceph-us...@ceph.com Subject: Re: [ceph-users] Ceph killed by OS because of OOM under high load On Mon, Jun 3, 2013 at 8:47 AM

Re: [ceph-users] replacing an OSD or crush map sensitivity

2013-06-03 Thread Chen, Xiaoxi
my 0.02, you really dont need to wait for health_ok between your recovery steps,just go ahead. Everytime a new map be generated and broadcasted,the old map and in-progress recovery will be canceled 发自我的 iPhone 在 2013-6-2,11:30,Nigel Williams nigel.d.willi...@gmail.com 写道: Could I have a

Re: [ceph-users] increasing stability

2013-05-29 Thread Chen, Xiaoxi
Cannot agree more,when I trying to promote ceph to internal state holder,they always complaining the stability of ceph,especially when they are evaluating ceph with high enough pressure, ceph cannot stay heathy during the test. 发自我的 iPhone 在 2013-5-29,19:13,Wolfgang Hennerbichler

Re: [ceph-users] qemu-1.4.2 rbd-fixed ubuntu packages

2013-05-29 Thread Chen, Xiaoxi
Hi, Can I assume i am safe without this patch if i don't use any rbd cache? 发自我的 iPhone 在 2013-5-29,16:00,Alex Bligh a...@alex.org.uk 写道: On 28 May 2013, at 06:50, Wolfgang Hennerbichler wrote: for anybody who's interested, I've packaged the latest qemu-1.4.2 (not 1.5, it didn't

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Chen, Xiaoxi
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode now rmw(wr=0) -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年5月15日 11:40 To: Chen, Xiaoxi Cc: Mark Nelson

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Chen, Xiaoxi
Thanks, but i am not quite understand how to determine weather monitor overloaded? and if yes,will start several monitor help? 发自我的 iPhone 在 2013-5-15,23:07,Jim Schutt jasc...@sandia.gov 写道: On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: How responsive generally is the machine under load

[ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-14 Thread Chen, Xiaoxi
Hi We are suffering our OSD flipping between up and down ( OSD X be voted to down due to 3 missing ping, and after a while it tells the monitor map xxx wrongly mark me down ). Because we are running sequential write performance test on top of RBDs, and the cluster network nics is really in

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-14 Thread Chen, Xiaoxi
% io wait).Enabling jumbo frame **seems** make things worth.(just feeling.no data supports) 发自我的 iPhone 在 2013-5-14,23:36,Mark Nelson mark.nel...@inktank.com 写道: On 05/14/2013 10:30 AM, Sage Weil wrote: On Tue, 14 May 2013, Chen, Xiaoxi wrote: Hi We are suffering our OSD flipping

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-14 Thread Chen, Xiaoxi
related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: 2013年5月15日 7:23 To: Chen, Xiaoxi Cc: Mark Nelson; ceph-de

Re: [ceph-users] Journal size

2013-03-26 Thread Chen, Xiaoxi
Are you using a partition as journal? From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Aleksey Samarin Sent: 2013年3月26日 20:45 To: ceph-us...@ceph.com Subject: [ceph-users] Journal size Hello everyone! I have question about journal. Ceph cluster is

[ceph-users] Questions about the meaning of osd_client_message_size_cap and etc

2013-03-26 Thread Chen, Xiaoxi
Hi Mark, I think you are the right man for these questions :) I am really don't understand how osd_client_message_size_cap , objecter_infilght_op_bytes/ops, ms_dispatch_throttle_bytes works? And how they affect performance. Especially ,the objecter_inflight_op_bytes seems be used

Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
Rephrase it to make it more clear From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chen, Xiaoxi Sent: 2013年3月25日 17:02 To: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com) Cc: ceph-de...@vger.kernel.org Subject: [ceph-users] Ceph Crach

[ceph-users] Unable to start ceph monitor in V0.59

2013-03-21 Thread Chen, Xiaoxi
Hi List, I cannot start my monitor when I update my cluster to v0.59, pls note that I am not trying to upgrade,but by reinstall the ceph software stack and rerunning mkcephfs. I have seen that the monitor change a lot after 0.58, is the mkcephfs still have bugs ? Below is the log:

Re: [ceph-users] create volume from an image

2013-03-20 Thread Chen, Xiaoxi
I think Josh may be the right man for this question ☺ To be more precious, I would like to add more words about the status: 1. We have configured “show_image_direct_url= Ture” in Glance, and from the Cinder-volume’s log, we can make sure we have got a direct_url , for example. image_id

Re: [ceph-users] create volume from an image

2013-03-20 Thread Chen, Xiaoxi
Thanks josh,the problem is solved by updating ceph in the glance node. 发自我的 iPhone 在 2013-3-20,14:59,Josh Durgin josh.dur...@inktank.com 写道: On 03/19/2013 11:03 PM, Chen, Xiaoxi wrote: I think Josh may be the right man for this question ☺ To be more precious, I would like to add more words

Re: [ceph-users] SL4500 as a storage machine

2013-03-17 Thread Chen, Xiaoxi
For me,We have seem a supermicro machine,which is 2U with 2 CPU and 24 2.5 inch sata/sas drives,together with 2 onboard 10Gb Nic. I think it's good enough for both density and computing power. To another end, we are also planning to evaluating small node for ceph,say a ATOM with 2 /4 disks per