Re:Re: Consult some problems of Ceph when reading source code
On Thu, 6 Aug 2015, ?? wrote: Dear Dr.Sage: Thank you for your detailed reply?These answers helps me a lot. I also have some problems in Question (1. In your reply, the requests according to the different PG enqueue into the ShardedWQ, if I have 3 requests( that is pg1,r1,pg2,r2,pg3,r3), and I put them to the ShardedWQ, is the process also aserializes processing? Lots of threads are enqueuing things into the ShardedWQ. A deterministic function of the pg determines which shard the request lands in. https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8247 When I want to dequeue the item from ShardedWQ, there is a work_queues (the type is the vector of work_queue) in ThreadPool method(WorkQueue.cc) and then I calculate the work queue according to the work_queues, so is there many work queue in the request process? or is there no association with the ShardedWQ? https://github.com/ceph/ceph/blob/master/src/common/WorkQueue.cc#L350 Any given thread services a single shard. There can be more than one threads per shard. There's a bunch of code in OSD.cc that ensures that the requests for any given PG are processed in order, serially, so if two threads pull off requests for the same PG one will block so that they still complete in order. When I get the item from ShardedWQ, I will transfer it to the transaction and then read or write. Is the process done one by one( another transaction is handled only when this transaction is over), if it is, Could we promise the performance? if it isn't , Are the transactions'actions parallel? The write operations are analyzed, prepared, and then started (queued for disk and replicated over the network). Completion is asynchronous (since it can take a while). The read operatoins are currently done synchronously (we block while we read the data from the local copy on disk), although this is likely to change soon to be either synchronous or async (depending on the backend, hardware, etc.). HTH! sage Thank you a lot! At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote: Hi! On Thu, 6 Aug 2015, ?? wrote: Dear developers, My name is Cai Yi, and I am a graduate student majored in CS of Xi?an Jiaotong University in China. From Ceph?s homepage, I know Sage is the author of Ceph and I get the email address from your GitHub and Ceph?s official website. Because Ceph is an excellent distributed file system, so recently, I am reading the source code of the Ceph (the edition is Hammer) to understand the IO good path and the performance of Ceph. However, I face some problems which I could not find the solution from Internet or solve by myself and my partners. So I was wondering if you could help us solve some problems. The problems are as follows: 1) In the Ceph, there is a concept that is the transaction. When the OSD receives a write request, and then it is encapsulated by a transaction. But When the OSD receive many requests, is there a transaction queue to receive the messages? If there is a queue, is it a process of serial or parallel to submit these transaction to do next operation? If it is serial, could the transaction operations influence the performance? The requests are distributed across placement groups and into a shared work queue, implemented by ShardedWQ in common/WorkQueue.h. This serializes processing for a given PG, but this generally makes little difference as there are typically 100 or more PGs per OSD. 2) From some documents about Ceph, if the OSD receives a read request, the OSD can only read data from primary and then back to client. Is the description right? Yes. This is usually the right thing to do or else a given object will end up consuming cache (memory) on more than one OSD and the overall cache efficiency of the cluster will drop by your replication factor. It's only a win to distributed reads when you have a very hot object, or when you want to spend OSD resources by reduce latency (e.g., by sending reads to all replica and taking the fastest reply). Is there any way to read the data from replicated OSD? Do we have to require the data from the primary OSD when deal with the read request? If not and we can read from replicated OSD, could we promise the consistency? There is a client-side flag to read from a random or the closest replica, but there are a few bugs that affect consistency when recovery is underway that are being fixed up now. It is likely that this will work correctly in Infernalis, the next stable release. 3) When the OSD receives the message, the message?s attribute may be the normal dispatch or the fast dispatch. What is the difference between the normal dispatch and the fast dispatch? If the attribute is the normal dispatch, it enters the dispatch queue. Is there a single dispatch queue or multi dispatch queue to deal
Newbie question about metadata_list.
Hi, I'm writing some program to replace image in cluster with it's copy. But I have problem with metadata_list. I created pool: #rados mkpool dupa then I created image: #rbd create --size 1000 -p mypool image --image-format 2 Below is code which tries to get metadata, but it fails with -EOPNOTSUPP. I compile it with command like this: g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src #include stdio.h #include stdlib.h #include include/rbd/librbd.h #include include/rbd/librbd.hpp using namespace ceph; #include librbd/ImageCtx.h int main() { rados_t clu; int ret = rados_create(clu, NULL); if (ret) return -1; ret = rados_conf_read_file(clu, NULL); if (ret) return -1; rados_conf_parse_env(clu, NULL); ret = rados_connect(clu); if (ret) return -1; rados_ioctx_t io; ret = rados_ioctx_create(clu, mypool, io); if (ret) return -1; rbd_image_t im; ret = rbd_open(io, image, im, NULL); if (ret) return -1; librbd::ImageCtx *ic = (librbd::ImageCtx*)im; std::string start; int max = 1000; bufferlist in, out; ::encode(start, in); ::encode(max, in); ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, metadata_list, in, out); if (ret 0) printf(fail\n); return 0; } So, my question is: what should be set/enabled to get those metadata? Or maybe what I'm doing wrong here. Thanks for your help. Best regards, Łukasz -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: civetweb health check
On 05-08-15 18:37, Srikanth Madugundi wrote: Hi, We are planning to move our radosgw setup from apache to civetweb. We were successfully able to setup and run civetweb on a test cluster. The radosgw instances are fronted by a VIP with currently checks the health by getting /status.html file, after moving to civetweb the vip is unable to get the health of radosgw server using /status.html endpoint and assumes the server is down. I looked at ceph radosgw documentation and did not find any configuration to rewrite urls. What is the best approach for VIP to get the health of radosgw? You can simply query / This is what I use in Varnish to do a health check: backend rgw { .host = 127.0.0.1; .port = 7480; .connect_timeout= 1s; .probe = { .timeout = 30s; .interval = 3s; .window= 10; .threshold = 3; .request = GET / HTTP/1.1 Host: localhost User-Agent: Varnish-health-check Connection: close; } } Works fine, RGW will respond with a 200 OK in / Wido Thanks Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I checked syncfs code in 3.10/4.1 kernel. I think both kernels only traverse dirty inodes (inodes in bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing? That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. :( sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More ondisk_finisher thread?
Sorry for the noise. I have find out the cause in our setup and case: We gathered too many logs in our RADOS IO path, and the latency seems to be reasonable(about 0.026 ms) if we don't gather that many logs... 2015-08-05 20:29 GMT+08:00 Sage Weil s...@newdream.net: On Wed, 5 Aug 2015, Ding Dinghua wrote: 2015-08-05 0:13 GMT+08:00 Somnath Roy somnath@sandisk.com: Yes, it has to re-acquire pg_lock today.. But, between journal write and initiating the ondisk ack, there is one context switche in the code path. So, I guess the pg_lock is not the only one that is causing this 1 ms delay... Not sure increasing the finisher threads will help in the pg_lock case as it will be more or less serialized by this pg_lock.. My concern is, if pg lock of pg A has been grabbed, not only ondisk callback of pg A is delayed, since ondisk_finisher has only one thread, ondisk callback of other pgs will be delayed too. I wonder if an optimistic approach might help here by making the completion synchronous and doing something like if (pg-lock.TryLock()) { pg-_finish_thing(completion-op); delete completion; } else { finisher.queue(completion); } or whatever. We'd need to ensure that we aren't holding any lock or throttle budget that the pg could deadlock against. sage -- Ding Dinghua -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-user status
On Wed, 5 Aug 2015, Milan Broz wrote: On 08/04/2015 10:53 PM, Sage Weil wrote: I rebased the wip-user patches from wip-selinux-policy onto wip-selinux-policy-no-user + merge to master so that it sits on top of the newly-merged systemd changes. Great, so if it is build-ready state, I can try it with our virtual cluster install. Notes/issues: - ceph-osd-prestart.sh verifies that the osd_data dir is owned by either 'root' or 'ceph' or else it exits with an error. (Presumably systemd will fail to start the unit in this case.) It prints a helpful message pointing the user at 'ceph-disk chown ...'. - 'ceph-disk chown ...' is not implemented yet. Should it take the base device, like activate and prepare? Or a mounted path? Or either? It should be easy to convert device/mountpoint by using findmnt so I would prefer what is more consistent with the user interface... IIRC, if the parameter is a base device, what should happen if device is not mounted? If mount path - then what about other data/journal partitions? It seems to me that parameter could be base OSD device and chown will simply handle all its partitions. (So for encrypted OSD it needs to get key to unlock it etc...) This sounds like the cleanest approach to me too. - Currently ceph-osd@.service unconditionally passes --setuser ceph to ceph-osd... even if the data directory is owned by root. I don't think systemd is smart enough to do this conditionally unless we make an ugly wrapper script that starts ceph-osd. Alternatively, we could make ceph-osd conditionally do the setuid based on the ownership of the directory, but... meh. The idea was to do the setuid *very* early in the startup process so that logging and so on are opened as the ceph user. Ideas? Well, systemd could do that if the service is generated (like e.g. cryptsetup activation jobs are generated according to crypttab). But this adds complexity that we do not need... Maybe another option is to use environment variable (CEPH_USER or so), set it in service Environment=/EnvironmentFile... and ceph-osd will use that... But I think some systemd gurus will find something better here:) Take a look at https://github.com/ceph/ceph/pull/5494 The idea is to just make the setuid in the daemon conditional on a path in the file system matching the uid/gid. If they match, we drop privs. If they don't, we print a warning and remain root. This doesn't handle the case where the daemon data dir is owned by something other than ceph or root. It will work just fine (the daemon will run as root), but perhaps we want fail in that case? The OSD has an explicit check for this in ceph-osd-prestart.sh asking the admin to ceph-disk chown, but the other daemons don't have prestarts. They also generally won't have mismatche ownership because they generally won't get swapped around between hosts... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: civetweb health check
hitting '/' endpoint worked. Thanks Srikanth On Thu, Aug 6, 2015 at 1:26 AM, Wido den Hollander w...@42on.com wrote: On 05-08-15 18:37, Srikanth Madugundi wrote: Hi, We are planning to move our radosgw setup from apache to civetweb. We were successfully able to setup and run civetweb on a test cluster. The radosgw instances are fronted by a VIP with currently checks the health by getting /status.html file, after moving to civetweb the vip is unable to get the health of radosgw server using /status.html endpoint and assumes the server is down. I looked at ceph radosgw documentation and did not find any configuration to rewrite urls. What is the best approach for VIP to get the health of radosgw? You can simply query / This is what I use in Varnish to do a health check: backend rgw { .host = 127.0.0.1; .port = 7480; .connect_timeout= 1s; .probe = { .timeout = 30s; .interval = 3s; .window= 10; .threshold = 3; .request = GET / HTTP/1.1 Host: localhost User-Agent: Varnish-health-check Connection: close; } } Works fine, RGW will respond with a 200 OK in / Wido Thanks Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Erasure Code Plugins : PLUGINS_V3 feature
Hi Loic, Thank you for arranging PLUGINS_V3 feature. I had just started to review pull request #5493. Please wait just a moment. By the way, may I ask what kind of status #5257 (decoding cache: the last immediate request from SHEC) currently is? https://github.com/ceph/ceph/pull/5257 Tell us if we need to rebase again. Best regards, Takeshi Miyamae -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Thursday, August 6, 2015 10:58 PM To: Miyamae, Takeshi/宮前 剛 Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔 Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature Hi Takeshi, https://github.com/ceph/ceph/pull/5493 is ready for your review. The matching integration tests can be found at https://github.com/ceph/ceph-qa-suite/pull/523 Cheers On 06/08/2015 02:28, Miyamae, Takeshi wrote: Dear Sage, note that what this really means is that the on-disk encoding needs to remain fixed. Thank you for letting us know the important notice. We have no plan to change shec's format at this moment, but we will remember the comment for any future events. Best Regards, Takeshi Miyamae -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, August 6, 2015 3:45 AM To: Loic Dachary; Miyamae, Takeshi/宮前 剛 Cc: Samuel Just; Ceph Development Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature On Wed, 5 Aug 2015, Loic Dachary wrote: Hi Sam, How does this proposal sound ? It would be great if that was done before the feature freeze. I think it's a good time. Takeshi, note that what this really means is that the on-disk encoding needs to remain fixed. If we decide to change it down the line, we'll have to make a 'shec2' or similar so that the old format is still decodable (or ensure that existing data can still be read in some other way). Sound good? sage Cheers On 29/07/2015 11:16, Loic Dachary wrote: Hi Sam, The SHEC plugin[0] has been running in the rados runs[1] in the past few months. It also has a matching corpus verification which runs on every make check[2] as well as its optimized variants. I believe the flag experimental can now be removed. In order to do so, we need to use a PLUGINS_V3 feature, in the same way we did back in Giant when the ISA and LRC plugins were introduced[3]. This won't be necessary in the future, when there is a generic plugin mechanism, but right now that's what we need. It would be a commit very similar to the one implementing PLUGINS_V2[4]. Is this agreeable to you ? Or would you rather see another way to resolve this ? Cheers [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec [1] https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras h-erasure-code-shec [2] https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9 88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343 [4] https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420 7a6d7489 -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: [ceph-users] Is it safe to increase pg number in a production environment
Hi Jan, Thank you very much for the suggestion. Regards, Jevon On 5/8/15 19:36, Jan Schermer wrote: Hi, comments inline. On 05 Aug 2015, at 05:45, Jevon Qiao qiaojianf...@unitedstack.com wrote: Hi Jan, Thank you for the detailed suggestion. Please see my reply in-line. On 5/8/15 01:23, Jan Schermer wrote: I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production. Basicaly we had to 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs 2) increse pgp_num in small increments and then go higher So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number… Actually we first increased both to 8192 and then decided to go higher, but that doesn’t matter. The only reason for this was that the first step took could run unattended at night without disturbing the workload.* The second step had to be attended. * in other words, we didn’t see “slow requests” because of our threshold settings, but while PGs were creating the cluster paused IO for non-trivial amounts of time. I suggest you do this in as small steps as possible, depending on your SLAs. We went from 4096 placement groups up to 16384 pg_num (the number of on-disk created placement groups) was increased like this: # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done this ran overnight (and was upped to 128 step during the night) Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs. We did it again in steps and waited for the cluster to settle before continuing. Each step upped pgp_num by about 2% and as we got higher (8192) we increased this to much more - the last step was 15360-16384 with the same impact the initial 4096-4160 had. The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling. Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts... This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources. Depends on how much free space you have. We had some OSDs at close to 85% capacity before we started (and other OSD’s at only 30%). When increasing the number of PGs the data shuffled greatly - but this depends on what CRUSH rules you have (and what version you are running). Newer versions with newer tunables will make this a lot easier I guess. And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing. In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it? We are using straw because of dumpling. Straw2 should make everything better :-) Jan Thanks, Jevon Jan On 04 Aug 2015, at 18:52, Marek Dohojda mdoho...@altitudedigital.com wrote: I have done this not that long ago. My original PG estimates were wrong and I had to increase them. After increasing the PG numbers the Ceph rebalanced, and that took a while. To be honest in my case the slowdown wasn’t really visible, but it took a while. My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish. Do it slowly and do not increase multiple pools at once. It isn’t recommended practice but doable. On Aug 4, 2015, at 10:46 AM, Samuel Just sj...@redhat.com wrote: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster
Re: About the Ceph erasure pool with ISA plugin on Intel xeon CPU
Hello, Loic the following is my steps and configurations: (1) The 11 osd and 3 monitors were ran in the docker container on the same host machine. (2) Each osd had one 1T HDD. (3) I set the erasure coding pool profiles: ## Jerasure, reed-soloman $ ceph osd erasure-code-profile set reed_k4m2_A k=4 m=2 directory=/usr/lib64/ceph/erasure-code ## ISA, reed-soloman ceph osd erasure-code-profile set reed_k4m2_isa_A k=4 m=2 directory=/usr/lib64/ceph/erasure-code plugin=isa technique=reed_sol_van (4) Then, the erasure pools were created: ## Jerasure, reed-soloman $ $ceph osd pool create reed_k4m2_A_pool 128 128 erasure reed_k4m2_A ## ISA, reed-soloman $ ceph osd pool create reed_k4m2_isa_A_pool 128 128 erasure reed_k4m2_isa_A (5) Then, I used the rados benchmark to test the write performance ## Jerasure, reed-soloman rados bench -p reed_k4m2_A_pool 500 write --no-cleanup ## ISA, reed-soloman rados bench -p reed_k4m2_isa_A_pool write --no-cleanup The results: (1) Jerasure/Reed-Soloman Write throughput: 136.0 MB/S, Latency: 0.471 (2) ISA/Reed-Soloman Write throughput: 133.1 MB/S, Latency: 0.481 (3) Jerasure/cauchy_good Write throughput: 138.3 MB/S, Latency: 0.462 (4) ISA/cauchy Write throughput: 140.2 MB/S, Latency: 0.452 -- My CPU information: Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz $ cat /proc/cpuinfo | grep flags flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt Ram: 12 GiB The results of the performance tests seem there are no differences... Thanks, :) Derek 2015-08-06 20:31 GMT+08:00 Loic Dachary l...@dachary.org: Hi, Could you please publish the benchmark results somewhere ? I should be able to figure out why you don't see a difference. Cheers On 06/08/2015 13:25, Derek Su wrote: Dear Mr. Dachary and all, Recently, I found your blog show the performance tests of erasure pools (http://dachary.org/?p=3042 , http://dachary.org/?p=3665). The results indicates the write throughput can be enhanced significantly using Intel xeon CPU. I tried to create an erasure pool with isa plugin, reed_sol_van technique, and k/m=4/2 on the Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz machines. However, the results of the rados benchmark showed that there was no any difference between the jerasure and isa plugins. It seems very strange. Do I need to do other configurations in addition to only setting the erasure profile? In addition, how can I know the erasure pool is accelerated by ISA plugin exactly? Is there any command I can use? Thanks, :) Derek Su. -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OSD sometimes stuck in init phase
Thanks for quick response Haomai! Please find the backtrace here [1]. [1] - http://paste.openstack.org/show/411139/ Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 5:31 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Could you print your all thread callback via thread apply all bt? On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: OSD sometimes stuck in init phase
Could you print your all thread callback via thread apply all bt? On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OSD sometimes stuck in init phase
Please find ceph.conf at [1] and the corresponding OSD log at [2]. To clarify one thing I skipped earlier on, is while bringing up the OSDs, 'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I had to temporarily disable 'journal dio' to get the disk activated (with a 'mark-init' set to none) and then explicitly start the OSD service after updating the conf to enable 'journal dio'. I am hopeful that this should not cause the present issue (since few OSD start successfully on first attempt and others on subsequent service restarts)! [1] - http://paste.openstack.org/show/411161/ [2] - http://paste.openstack.org/show/411162/ [3] - http://tracker.ceph.com/issues/9768 Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 6:22 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Don't find something strange. Could you paste your ceph.conf? And restart this osd with debug_osd=20/20, debug_filestore=20/20 :-) On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Thanks for quick response Haomai! Please find the backtrace here [1]. [1] - http://paste.openstack.org/show/411139/ Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 5:31 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Could you print your all thread callback via thread apply all bt? On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- Best Regards, Wheat
Re: FileStore should not use syncfs(2)
On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I'm pretty sure Dave had some patches for that, Even if they aren't included it's not an unsolved problem. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. That additional fsync in XFS is basically free, so better get it right and let the file system micro optimize for you. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, 6 Aug 2015, Haomai Wang wrote: Agree On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote: Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system. The workaround I was talking about today is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking 100 ms. Actually I prefer we don't use syncfs anymore. I more like to use aio+dio+Filestore custom cache to deal with all syncfs+pagecache things. So we even can make cache more smart to aware of upper levels instead of fadvise* calls. Second we can use checkpoint method like mysql innodb, we can know the bw of frontend(filejournal) and decide how much and how often we want to flush(using aio+dio). Anyway, because it's a big project, we may prefer to work at newstore instead of filestore. I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ? Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, August 05, 2015 2:27 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org; sj...@redhat.com Subject: FileStore should not use syncfs(2) Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. I guess there only a little directory modify operations, is it true? Maybe we only need to do syncfs when modifying directories? I'd say there are a few broad cases: - creating or deleting objects. simply fsyncing the file is sufficient on XFS; we should confirm what the behavior is on other distros. But even if we d the fsync on the dir this is simple to implement. - renaming objects (collection_move_rename). Easy to add an fsync here. - HashIndex rehashing. This is where I get nervous... and setting some flag that triggers a full syncfs might be an interim solution since it's a pretty rare event. OTOH, adding the fsync calls in the HashIndex code probably isn't so bad to audit and get right either... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, 6 Aug 2015, Christoph Hellwig wrote: On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I'm pretty sure Dave had some patches for that, Even if they aren't included it's not an unsolved problem. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. That additional fsync in XFS is basically free, so better get it right and let the file system micro optimize for you. I'm guessing the strategy here should be to fsync the file (leaf) and then any affected ancestors, such that the directory fsyncs are effectively no-ops? Or does it matter? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD sometimes stuck in init phase
Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Consult some problems of Ceph when reading source code
Hi! On Thu, 6 Aug 2015, ?? wrote: Dear developers, My name is Cai Yi, and I am a graduate student majored in CS of Xi?an Jiaotong University in China. From Ceph?s homepage, I know Sage is the author of Ceph and I get the email address from your GitHub and Ceph?s official website. Because Ceph is an excellent distributed file system, so recently, I am reading the source code of the Ceph (the edition is Hammer) to understand the IO good path and the performance of Ceph. However, I face some problems which I could not find the solution from Internet or solve by myself and my partners. So I was wondering if you could help us solve some problems. The problems are as follows: 1) In the Ceph, there is a concept that is the transaction. When the OSD receives a write request, and then it is encapsulated by a transaction. But When the OSD receive many requests, is there a transaction queue to receive the messages? If there is a queue, is it a process of serial or parallel to submit these transaction to do next operation? If it is serial, could the transaction operations influence the performance? The requests are distributed across placement groups and into a shared work queue, implemented by ShardedWQ in common/WorkQueue.h. This serializes processing for a given PG, but this generally makes little difference as there are typically 100 or more PGs per OSD. 2) From some documents about Ceph, if the OSD receives a read request, the OSD can only read data from primary and then back to client. Is the description right? Yes. This is usually the right thing to do or else a given object will end up consuming cache (memory) on more than one OSD and the overall cache efficiency of the cluster will drop by your replication factor. It's only a win to distributed reads when you have a very hot object, or when you want to spend OSD resources by reduce latency (e.g., by sending reads to all replica and taking the fastest reply). Is there any way to read the data from replicated OSD? Do we have to require the data from the primary OSD when deal with the read request? If not and we can read from replicated OSD, could we promise the consistency? There is a client-side flag to read from a random or the closest replica, but there are a few bugs that affect consistency when recovery is underway that are being fixed up now. It is likely that this will work correctly in Infernalis, the next stable release. 3) When the OSD receives the message, the message?s attribute may be the normal dispatch or the fast dispatch. What is the difference between the normal dispatch and the fast dispatch? If the attribute is the normal dispatch, it enters the dispatch queue. Is there a single dispatch queue or multi dispatch queue to deal with all the messages? There is a single thread that does the normal dispatch. Fast dispatch processes the message synchrnonously from the thread that received the message, so it faster, but it has to be careful not to block. These are the problem I am facing. Thank you for your patience and cooperation, and I look forward to hearing from you. Hope that helps! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, 6 Aug 2015, Yan, Zheng wrote: On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I checked syncfs code in 3.10/4.1 kernel. I think both kernels only traverse dirty inodes (inodes in bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing? See wait_sb_inodes in fs/fs-writeback.c, called by sync_inodes_sb. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About the Ceph erasure pool with ISA plugin on Intel xeon CPU
Dear Mr. Dachary and all, Recently, I found your blog show the performance tests of erasure pools (http://dachary.org/?p=3042 , http://dachary.org/?p=3665). The results indicates the write throughput can be enhanced significantly using Intel xeon CPU. I tried to create an erasure pool with isa plugin, reed_sol_van technique, and k/m=4/2 on the Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz machines. However, the results of the rados benchmark showed that there was no any difference between the jerasure and isa plugins. It seems very strange. Do I need to do other configurations in addition to only setting the erasure profile? In addition, how can I know the erasure pool is accelerated by ISA plugin exactly? Is there any command I can use? Thanks, :) Derek Su. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie question about metadata_list.
On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk lukasz.szymc...@corp.ovh.com wrote: Hi, I'm writing some program to replace image in cluster with it's copy. But I have problem with metadata_list. I created pool: #rados mkpool dupa then I created image: #rbd create --size 1000 -p mypool image --image-format 2 Below is code which tries to get metadata, but it fails with -EOPNOTSUPP. I compile it with command like this: g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src #include stdio.h #include stdlib.h #include include/rbd/librbd.h #include include/rbd/librbd.hpp using namespace ceph; #include librbd/ImageCtx.h int main() { rados_t clu; int ret = rados_create(clu, NULL); if (ret) return -1; ret = rados_conf_read_file(clu, NULL); if (ret) return -1; rados_conf_parse_env(clu, NULL); ret = rados_connect(clu); if (ret) return -1; rados_ioctx_t io; ret = rados_ioctx_create(clu, mypool, io); if (ret) return -1; rbd_image_t im; ret = rbd_open(io, image, im, NULL); if (ret) return -1; librbd::ImageCtx *ic = (librbd::ImageCtx*)im; std::string start; int max = 1000; bufferlist in, out; ::encode(start, in); ::encode(max, in); ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, metadata_list, in, out); if (ret 0) printf(fail\n); You should use rbd_metadata_list() C API instead of this. return 0; } So, my question is: what should be set/enabled to get those metadata? Or maybe what I'm doing wrong here. Try rbd image-meta list image? It's a fairly recent feature, are you sure your OSDs support it? What's the output of ceph daemon osd.0 version? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD sometimes stuck in init phase
Don't find something strange. Could you paste your ceph.conf? And restart this osd with debug_osd=20/20, debug_filestore=20/20 :-) On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Thanks for quick response Haomai! Please find the backtrace here [1]. [1] - http://paste.openstack.org/show/411139/ Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 5:31 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Could you print your all thread callback via thread apply all bt? On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote: I'm guessing the strategy here should be to fsync the file (leaf) and then any affected ancestors, such that the directory fsyncs are effectively no-ops? Or does it matter? All metadata transactions log the involve parties (parent and child inode(s) mostly) in the same transaction. So flushing one of them out is enough. But file data I/O might dirty the inode before flushing them out, so to not need to write out the inode log item twice you first want to fsync any file that had data I/O followed by directories or special files that only had metadata modified. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD sometimes stuck in init phase
It seemed filestore doesn't do transaction as expected. Sorry, you need to add debug_journal=20/20 to help find the reason. :-) BTW, what's your os version? How many osds do you have in this cluster, how many osds failed to start like this? On Thu, Aug 6, 2015 at 9:17 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Please find ceph.conf at [1] and the corresponding OSD log at [2]. To clarify one thing I skipped earlier on, is while bringing up the OSDs, 'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I had to temporarily disable 'journal dio' to get the disk activated (with a 'mark-init' set to none) and then explicitly start the OSD service after updating the conf to enable 'journal dio'. I am hopeful that this should not cause the present issue (since few OSD start successfully on first attempt and others on subsequent service restarts)! [1] - http://paste.openstack.org/show/411161/ [2] - http://paste.openstack.org/show/411162/ [3] - http://tracker.ceph.com/issues/9768 Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 6:22 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Don't find something strange. Could you paste your ceph.conf? And restart this osd with debug_osd=20/20, debug_filestore=20/20 :-) On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Thanks for quick response Haomai! Please find the backtrace here [1]. [1] - http://paste.openstack.org/show/411139/ Regards, Unmesh G. IRC: unmeshg -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 06, 2015 5:31 PM To: Gurjar, Unmesh Cc: ceph-devel@vger.kernel.org Subject: Re: OSD sometimes stuck in init phase Could you print your all thread callback via thread apply all bt? On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote: Hi, On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate data and journal disks (using the ceph-disk utility). It is observed, that few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 'init creating/touching snapmapper object' phase. Below is a OSD start-up log snippet: 2015-08-06 08:58:02.491537 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498447 7fd312df97c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-08-06 08:58:02.498720 7fd312df97c0 2 osd.0 0 boot 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0]) 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching snapmapper object The log statement is inaccurate though, since it is actually doing init operation for the 'infos' object (as can be observed from source [2]). Upon debugging further, the thread seems to be waiting to acquire the 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace: (gdb) where #0 0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fd313132bf4 in ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , Context*) () #2 0x7fd313097d08 in ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) () #3 0x7fd313076790 in OSD::init() () #4 0x7fd3130233a7 in main () In a few cases, upon restarting the stuck OSD (service), it successfully completes the 'init' phase and reaches the 'up' and 'in' state! Any help is greatly appreciated. Please let me know if any more details are required for root causing. [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) [2] - https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211 Regards, Unmesh G. IRC: unmeshg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- Best Regards, Wheat -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Erasure Code Plugins : PLUGINS_V3 feature
Hi Takeshi, https://github.com/ceph/ceph/pull/5493 is ready for your review. The matching integration tests can be found at https://github.com/ceph/ceph-qa-suite/pull/523 Cheers On 06/08/2015 02:28, Miyamae, Takeshi wrote: Dear Sage, note that what this really means is that the on-disk encoding needs to remain fixed. Thank you for letting us know the important notice. We have no plan to change shec's format at this moment, but we will remember the comment for any future events. Best Regards, Takeshi Miyamae -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, August 6, 2015 3:45 AM To: Loic Dachary; Miyamae, Takeshi/宮前 剛 Cc: Samuel Just; Ceph Development Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature On Wed, 5 Aug 2015, Loic Dachary wrote: Hi Sam, How does this proposal sound ? It would be great if that was done before the feature freeze. I think it's a good time. Takeshi, note that what this really means is that the on-disk encoding needs to remain fixed. If we decide to change it down the line, we'll have to make a 'shec2' or similar so that the old format is still decodable (or ensure that existing data can still be read in some other way). Sound good? sage Cheers On 29/07/2015 11:16, Loic Dachary wrote: Hi Sam, The SHEC plugin[0] has been running in the rados runs[1] in the past few months. It also has a matching corpus verification which runs on every make check[2] as well as its optimized variants. I believe the flag experimental can now be removed. In order to do so, we need to use a PLUGINS_V3 feature, in the same way we did back in Giant when the ISA and LRC plugins were introduced[3]. This won't be necessary in the future, when there is a generic plugin mechanism, but right now that's what we need. It would be a commit very similar to the one implementing PLUGINS_V2[4]. Is this agreeable to you ? Or would you rather see another way to resolve this ? Cheers [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec [1] https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras h-erasure-code-shec [2] https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9 88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343 [4] https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420 7a6d7489 -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
testing the teuthology OpenStack backend
Hi, I'm looking into testing the OpenStack backend for teuthology on a new cluster to verify it's portable. I think it is but ... ;-) I'm told you have an OpenStack cluster and would be interested in running teuthology workloads on it. Does it have a public facing API ? Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re:Re: Consult some problems of Ceph when reading source code
Dear Dr.Sage: Thank you for your detailed reply!These answers helps me a lot. I also have some problems in Question (1. In your reply, the requests according to the different PG enqueue into the ShardedWQ, if I have 3 requests( that is pg1,r1,pg2,r2,pg3,r3), and I put them to the ShardedWQ, is the process also aserializes processing? When I want to dequeue the item from ShardedWQ, there is a work_queues (the type is the vector of work_queue) in ThreadPool method(WorkQueue.cc) and then I calculate the work queue according to the work_queues, so is there many work queue in the request process? or is there no association with the ShardedWQ? When I get the item from ShardedWQ, I will transfer it to the transaction and then read or write. Is the process done one by one( another transaction is handled only when this transaction is over), if it is, Could we promise the performance? if it isn't , Are the transactions'actions parallel? Thank you a lot! At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote: Hi! On Thu, 6 Aug 2015, ?? wrote: Dear developers, My name is Cai Yi, and I am a graduate student majored in CS of Xi?an Jiaotong University in China. From Ceph?s homepage, I know Sage is the author of Ceph and I get the email address from your GitHub and Ceph?s official website. Because Ceph is an excellent distributed file system, so recently, I am reading the source code of the Ceph (the edition is Hammer) to understand the IO good path and the performance of Ceph. However, I face some problems which I could not find the solution from Internet or solve by myself and my partners. So I was wondering if you could help us solve some problems. The problems are as follows: 1) In the Ceph, there is a concept that is the transaction. When the OSD receives a write request, and then it is encapsulated by a transaction. But When the OSD receive many requests, is there a transaction queue to receive the messages? If there is a queue, is it a process of serial or parallel to submit these transaction to do next operation? If it is serial, could the transaction operations influence the performance? The requests are distributed across placement groups and into a shared work queue, implemented by ShardedWQ in common/WorkQueue.h. This serializes processing for a given PG, but this generally makes little difference as there are typically 100 or more PGs per OSD. 2) From some documents about Ceph, if the OSD receives a read request, the OSD can only read data from primary and then back to client. Is the description right? Yes. This is usually the right thing to do or else a given object will end up consuming cache (memory) on more than one OSD and the overall cache efficiency of the cluster will drop by your replication factor. It's only a win to distributed reads when you have a very hot object, or when you want to spend OSD resources by reduce latency (e.g., by sending reads to all replica and taking the fastest reply). Is there any way to read the data from replicated OSD? Do we have to require the data from the primary OSD when deal with the read request? If not and we can read from replicated OSD, could we promise the consistency? There is a client-side flag to read from a random or the closest replica, but there are a few bugs that affect consistency when recovery is underway that are being fixed up now. It is likely that this will work correctly in Infernalis, the next stable release. 3) When the OSD receives the message, the message?s attribute may be the normal dispatch or the fast dispatch. What is the difference between the normal dispatch and the fast dispatch? If the attribute is the normal dispatch, it enters the dispatch queue. Is there a single dispatch queue or multi dispatch queue to deal with all the messages? There is a single thread that does the normal dispatch. Fast dispatch processes the message synchrnonously from the thread that received the message, so it faster, but it has to be careful not to block. These are the problem I am facing. Thank you for your patience and cooperation, and I look forward to hearing from you. Hope that helps! sage N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏渮榏z鳐妠ay�蕠跈�,jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸濟!秈
Re: radosgw + civetweb latency issue on Hammer
Hi Srikanth, Can you make a ticket on tracker.ceph.com for this? We'd like to not loose track of it. Thanks! Mark On 08/05/2015 07:01 PM, Srikanth Madugundi wrote: Hi, After upgrading to Hammer and moving from apache to civetweb. We started seeing high PUT latency in the order of 2 sec for every PUT request. The GET request lo Attaching the radosgw logs for a single request. The ceph.conf has the following configuration for civetweb. [client.radosgw.gateway] rgw frontends = civetweb port=5632 Further investion reveled the call to get_data() at https://github.com/ceph/ceph/blob/hammer/src/rgw/rgw_op.cc#L1786 is taking 2 sec to respond. The cluster is running Hammer 94.2 release Did any one face this issue before? Is there some configuration I am missing? Regards Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html