Re:Re: Consult some problems of Ceph when reading source code

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, ?? wrote:
 Dear Dr.Sage:

 Thank you for your detailed reply?These answers helps me a lot. I also 
 have some problems in Question (1.

 In your reply, the requests according to the different PG enqueue into 
 the ShardedWQ, if I have 3 requests( that is 
 pg1,r1,pg2,r2,pg3,r3), and I put them to the ShardedWQ, is the 
 process also aserializes processing?

Lots of threads are enqueuing things into the ShardedWQ.  A deterministic 
function of the pg determines which shard the request lands in.

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8247

 When I want to dequeue the item from ShardedWQ, there is a work_queues 
 (the type is the vector of work_queue) in ThreadPool 
 method(WorkQueue.cc) and then I calculate the work queue according to 
 the work_queues, so is there many work queue in the request process?  
 or is there no association with the ShardedWQ?

https://github.com/ceph/ceph/blob/master/src/common/WorkQueue.cc#L350

Any given thread services a single shard.  There can be more than 
one threads per shard.  There's a bunch of code in OSD.cc that ensures 
that the requests for any given PG are processed in order, serially, 
so if two threads pull off requests for the same PG one will block so 
that they still complete in order.

 When I get the item from ShardedWQ, I will transfer it to the 
 transaction and then read or write. Is the process done one by one( 
 another transaction is handled only when this transaction is over), if 
 it is, Could we promise the performance? if it isn't , Are the 
 transactions'actions parallel?

The write operations are analyzed, prepared, and then started (queued for 
disk and replicated over the network).  Completion is asynchronous (since 
it can take a while).

The read operatoins are currently done synchronously (we block while we 
read the data from the local copy on disk), although this is likely to 
change soon to be either synchronous or async (depending on the backend, 
hardware, etc.).

HTH!
sage


 Thank you a lot!
 
 
 
 
 
 At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote:
 Hi!
 
 On Thu, 6 Aug 2015, ?? wrote:
  Dear developers,
  
  My name is Cai Yi, and I am a graduate student majored in CS of Xi?an 
  Jiaotong University in China. From Ceph?s homepage, I know Sage is the 
  author of Ceph and I get the email address from your GitHub and Ceph?s 
  official website. Because Ceph is an excellent distributed file system, 
  so recently, I am reading the source code of the Ceph (the edition is 
  Hammer) to understand the IO good path and the performance of Ceph. 
  However, I face some problems which I could not find the solution from 
  Internet or solve by myself and my partners. So I was wondering if you 
  could help us solve some problems. The problems are as follows:
  
  1)  In the Ceph, there is a concept that is the transaction. When the 
  OSD receives a write request, and then it is encapsulated by a 
  transaction. But When the OSD receive many requests, is there a 
  transaction queue to receive the messages? If there is a queue, is it a 
  process of serial or parallel to submit these transaction to do next 
  operation? If it is serial, could the transaction operations influence 
  the performance?
 
 The requests are distributed across placement groups and into a shared 
 work queue, implemented by ShardedWQ in common/WorkQueue.h.  This 
 serializes processing for a given PG, but this generally makes little 
 difference as there are typically 100 or more PGs per OSD.
 
  2)  From some documents about Ceph, if the OSD receives a read request, 
  the OSD can only read data from primary and then back to client. Is the 
  description right?
 
 Yes.  This is usually the right thing to do or else a given object will 
 end up consuming cache (memory) on more than one OSD and the overall cache 
 efficiency of the cluster will drop by your replication factor.  It's only 
 a win to distributed reads when you have a very hot object, or when you 
 want to spend OSD resources by reduce latency (e.g., by sending reads to 
 all replica and taking the fastest reply).
 
  Is there any way to read the data from replicated 
  OSD? Do we have to require the data from the primary OSD when deal with 
  the read request? If not and we can read from replicated OSD, could we 
  promise the consistency?
 
 There is a client-side flag to read from a random or the closest 
 replica, but there are a few bugs that affect consistency when recovery is 
 underway that are being fixed up now.  It is likely that this will work 
 correctly in Infernalis, the next stable release.
 
  3)  When the OSD receives the message, the message?s attribute may be 
  the normal dispatch or the fast dispatch. What is the difference between 
  the normal dispatch and the fast dispatch? If the attribute is the 
  normal dispatch, it enters the dispatch queue. Is there a single 
  dispatch queue or multi dispatch queue to deal 

Newbie question about metadata_list.

2015-08-06 Thread Łukasz Szymczyk
Hi,

I'm writing some program to replace image in cluster with it's copy.
But I have problem with metadata_list.
I created pool:
#rados mkpool dupa
then I created image:
#rbd create --size 1000 -p mypool image --image-format 2

Below is code which tries to get metadata, but it fails with -EOPNOTSUPP.
I compile it with command like this:
g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src

#include stdio.h
#include stdlib.h
#include include/rbd/librbd.h
#include include/rbd/librbd.hpp
using namespace ceph;
#include librbd/ImageCtx.h

int main() {
rados_t clu;
int ret = rados_create(clu, NULL);
if (ret) return -1;

ret = rados_conf_read_file(clu, NULL);
if (ret) return -1;

rados_conf_parse_env(clu, NULL);
ret = rados_connect(clu);
if (ret) return -1;

rados_ioctx_t io;
ret = rados_ioctx_create(clu, mypool, io);
if (ret) return -1;

rbd_image_t im;
ret = rbd_open(io, image, im, NULL);
if (ret) return -1;

librbd::ImageCtx *ic = (librbd::ImageCtx*)im;
std::string start;
int max = 1000;
bufferlist in, out;
::encode(start, in);
::encode(max, in);
ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, 
metadata_list, in, out);
if (ret  0) printf(fail\n);

return 0;

}

So, my question is: what should be set/enabled to get those metadata?
Or maybe what I'm doing wrong here.
Thanks for your help.

Best regards,
Łukasz
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: civetweb health check

2015-08-06 Thread Wido den Hollander


On 05-08-15 18:37, Srikanth Madugundi wrote:
 Hi,
 
 We are planning to move our radosgw setup from apache to civetweb. We
 were successfully able to setup and run civetweb on a test cluster.
 
 The radosgw instances are fronted by a VIP with currently checks the
 health by getting /status.html file, after moving to civetweb the vip
 is unable to get the health of radosgw server using /status.html
 endpoint and assumes the server is down.
 
 I looked at ceph radosgw documentation and did not find any
 configuration to rewrite urls. What is the best approach for VIP to
 get the health of radosgw?
 

You can simply query /

This is what I use in Varnish to do a health check:

backend rgw {
.host   = 127.0.0.1;
.port   = 7480;
.connect_timeout= 1s;
.probe = {
.timeout   = 30s;
.interval  = 3s;
.window= 10;
.threshold = 3;
.request =
GET / HTTP/1.1
Host: localhost
User-Agent: Varnish-health-check
Connection: close;
}
}

Works fine, RGW will respond with a 200 OK in /

Wido

 Thanks
 Srikanth
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Yan, Zheng
On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's
 inode list searching for dirty items.  I've always assumed that it was
 only traversing dirty inodes (e.g., a list of dirty inodes), but that
 appears not to be the case, even on the latest kernels.


I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
traverse dirty inodes (inodes in
bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?


 That means that the more RAM in the box, the larger (generally) the inode
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
 servicing a very light workload, and each syncfs(2) call was taking ~7
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and
 make FileStore f[data]sync all the right things so that the syncfs call
 can be avoided.  This is the path you were originally headed down,
 Somnath, and I think it's the right one.

 The main thing to watch out for is that according to POSIX you really need
 to fsync directories.  With XFS that isn't the case since all metadata
 operations are going into the journal and that's fully ordered, but we
 don't want to allow data loss on e.g. ext4 (we need to check what the
 metadata ordering behavior is there) or other file systems.

 :(

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: More ondisk_finisher thread?

2015-08-06 Thread Ding Dinghua
Sorry for the noise.
I have find out the cause in our setup and case: We gathered too many
logs in our RADOS IO path, and the latency seems to be
reasonable(about 0.026 ms) if we don't gather that many logs...

2015-08-05 20:29 GMT+08:00 Sage Weil s...@newdream.net:
 On Wed, 5 Aug 2015, Ding Dinghua wrote:
 2015-08-05 0:13 GMT+08:00 Somnath Roy somnath@sandisk.com:
  Yes, it has to re-acquire pg_lock today..
  But, between journal write and initiating the ondisk ack, there is one 
  context switche in the code path. So, I guess the pg_lock is not the only 
  one that is causing this 1 ms delay...
  Not sure increasing the finisher threads will help in the pg_lock case as 
  it will be more or less serialized by this pg_lock..
 My concern is, if pg lock of pg A has been grabbed, not only ondisk
 callback of pg A is delayed, since ondisk_finisher has only one
 thread,  ondisk callback of other pgs will be delayed too.

 I wonder if an optimistic approach might help here by making the
 completion synchronous and doing something like

if (pg-lock.TryLock()) {
   pg-_finish_thing(completion-op);
   delete completion;
} else {
   finisher.queue(completion);
}

 or whatever.  We'd need to ensure that we aren't holding any lock or
 throttle budget that the pg could deadlock against.

 sage



-- 
Ding Dinghua
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-user status

2015-08-06 Thread Sage Weil
On Wed, 5 Aug 2015, Milan Broz wrote:
 On 08/04/2015 10:53 PM, Sage Weil wrote:
  I rebased the wip-user patches from wip-selinux-policy onto 
  wip-selinux-policy-no-user + merge to master so that it sits on top of the 
  newly-merged systemd changes.
 
 Great, so if it is build-ready state, I can try it with our virtual 
 cluster install.
 
  Notes/issues:
  
   - ceph-osd-prestart.sh verifies that the osd_data dir is owned by either 
  'root' or 'ceph' or else it exits with an error.  (Presumably systemd will 
  fail to start the unit in this case.)  It prints a helpful message 
  pointing the user at 'ceph-disk chown ...'.
  
   - 'ceph-disk chown ...' is not implemented yet.  Should it take the base 
  device, like activate and prepare?  Or a mounted path?  Or either?
 
 It should be easy to convert device/mountpoint by using findmnt so I would
 prefer what is more consistent with the user interface...
 
 IIRC, if the parameter is a base device, what should happen if device is not 
 mounted?
 If mount path - then what about other data/journal partitions?
 
 It seems to me that parameter could be base OSD device and chown will 
 simply handle all its partitions. (So for encrypted OSD it needs to get 
 key to unlock it etc...)

This sounds like the cleanest approach to me too.

   - Currently ceph-osd@.service unconditionally passes --setuser ceph to 
  ceph-osd... even if the data directory is owned by root.  I don't think 
  systemd is smart enough to do this conditionally unless we make an ugly 
  wrapper script that starts ceph-osd.  Alternatively, we could make 
  ceph-osd conditionally do the setuid based on the ownership of the 
  directory, but... meh.  The idea was to do the setuid *very* early in the 
  startup process so that logging and so on are opened as the ceph user.  
  Ideas?
 
 Well, systemd could do that if the service is generated (like e.g. cryptsetup
 activation jobs are generated according to crypttab). But this adds complexity
 that we do not need...
 
 Maybe another option is to use environment variable (CEPH_USER or so), set it
 in service Environment=/EnvironmentFile... and ceph-osd will use that...
 
 But I think some systemd gurus will find something better here:)

Take a look at

https://github.com/ceph/ceph/pull/5494

The idea is to just make the setuid in the daemon conditional on a path in 
the file system matching the uid/gid.  If they match, we drop privs.  If 
they don't, we print a warning and remain root.

This doesn't handle the case where the daemon data dir is owned by 
something other than ceph or root.  It will work just fine (the 
daemon will run as root), but perhaps we want fail in that case?  The 
OSD has an explicit check for this in ceph-osd-prestart.sh asking the 
admin to ceph-disk chown, but the other daemons don't have prestarts.  
They also generally won't have mismatche ownership because they generally 
won't get swapped around between hosts...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: civetweb health check

2015-08-06 Thread Srikanth Madugundi
hitting '/' endpoint worked.

Thanks
Srikanth

On Thu, Aug 6, 2015 at 1:26 AM, Wido den Hollander w...@42on.com wrote:


 On 05-08-15 18:37, Srikanth Madugundi wrote:
 Hi,

 We are planning to move our radosgw setup from apache to civetweb. We
 were successfully able to setup and run civetweb on a test cluster.

 The radosgw instances are fronted by a VIP with currently checks the
 health by getting /status.html file, after moving to civetweb the vip
 is unable to get the health of radosgw server using /status.html
 endpoint and assumes the server is down.

 I looked at ceph radosgw documentation and did not find any
 configuration to rewrite urls. What is the best approach for VIP to
 get the health of radosgw?


 You can simply query /

 This is what I use in Varnish to do a health check:

 backend rgw {
 .host   = 127.0.0.1;
 .port   = 7480;
 .connect_timeout= 1s;
 .probe = {
 .timeout   = 30s;
 .interval  = 3s;
 .window= 10;
 .threshold = 3;
 .request =
 GET / HTTP/1.1
 Host: localhost
 User-Agent: Varnish-health-check
 Connection: close;
 }
 }

 Works fine, RGW will respond with a 200 OK in /

 Wido

 Thanks
 Srikanth
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Erasure Code Plugins : PLUGINS_V3 feature

2015-08-06 Thread Miyamae, Takeshi
Hi Loic,

Thank you for arranging PLUGINS_V3 feature.
I had just started to review pull request #5493. Please wait just a moment.

By the way, may I ask what kind of status #5257 (decoding cache:
the last immediate request from SHEC) currently is?
https://github.com/ceph/ceph/pull/5257
Tell us if we need to rebase again.

Best regards,
Takeshi Miyamae

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org] 
Sent: Thursday, August 6, 2015 10:58 PM
To: Miyamae, Takeshi/宮前 剛
Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔
Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature

Hi Takeshi,

https://github.com/ceph/ceph/pull/5493 is ready for your review. The matching 
integration tests can be found at https://github.com/ceph/ceph-qa-suite/pull/523

Cheers

On 06/08/2015 02:28, Miyamae, Takeshi wrote:
 Dear Sage,
 
 note that what this really means is that the on-disk encoding needs to 
 remain fixed.
 
 Thank you for letting us know the important notice.
 We have no plan to change shec's format at this moment, but we will 
 remember the comment for any future events.
 
 Best Regards,
 Takeshi Miyamae
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Thursday, August 6, 2015 3:45 AM
 To: Loic Dachary; Miyamae, Takeshi/宮前 剛
 Cc: Samuel Just; Ceph Development
 Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature
 
 On Wed, 5 Aug 2015, Loic Dachary wrote:
 Hi Sam,

 How does this proposal sound ? It would be great if that was done 
 before the feature freeze.
 
 I think it's a good time.
 
 Takeshi, note that what this really means is that the on-disk encoding needs 
 to remain fixed.  If we decide to change it down the line, we'll have to make 
 a 'shec2' or similar so that the old format is still decodable (or ensure 
 that existing data can still be read in some other way).
 
 Sound good?
 
 sage
 
 

 Cheers

 On 29/07/2015 11:16, Loic Dachary wrote:
 Hi Sam,

 The SHEC plugin[0] has been running in the rados runs[1] in the past few 
 months. It also has a matching corpus verification which runs on every make 
 check[2] as well as its optimized variants. I believe the flag 
 experimental can now be removed. 

 In order to do so, we need to use a PLUGINS_V3 feature, in the same way we 
 did back in Giant when the ISA and LRC plugins were introduced[3]. This 
 won't be necessary in the future, when there is a generic plugin mechanism, 
 but right now that's what we need. It would be a commit very similar to the 
 one implementing PLUGINS_V2[4].

 Is this agreeable to you ? Or would you rather see another way to resolve 
 this ?

 Cheers

 [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec
 [1]
 https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras
 h-erasure-code-shec [2]
 https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9
 88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343
 [4]
 https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420
 7a6d7489


 --
 Loïc Dachary, Artisan Logiciel Libre



-- 
Loïc Dachary, Artisan Logiciel Libre

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-06 Thread Jevon Qiao

Hi Jan,

Thank you very much for the suggestion.

Regards,
Jevon
On 5/8/15 19:36, Jan Schermer wrote:

Hi,
comments inline.


On 05 Aug 2015, at 05:45, Jevon Qiao qiaojianf...@unitedstack.com wrote:

Hi Jan,

Thank you for the detailed suggestion. Please see my reply in-line.
On 5/8/15 01:23, Jan Schermer wrote:

I think I wrote about my experience with this about 3 months ago, including 
what techniques I used to minimize impact on production.

Basicaly we had to
1) increase pg_num in small increments only, bcreating the placement groups 
themselves caused slowed requests on OSDs
2) increse pgp_num in small increments and then go higher

So you totally completed the step 1 before jumping into step 2. Have you ever 
tried mixing them together? Increase pg_number, increase pgp_number, increase 
pg_number…

Actually we first increased both to 8192 and then decided to go higher, but 
that doesn’t matter.
The only reason for this was that the first step took could run unattended at 
night without disturbing the workload.*
The second step had to be attended.

* in other words, we didn’t see “slow requests” because of our threshold 
settings, but while PGs were creating the cluster paused IO for non-trivial 
amounts of time. I suggest you do this in as small steps as possible, depending 
on your SLAs.


We went from 4096 placement groups up to 16384

pg_num (the number of on-disk created placement groups) was increased like this:
# for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 
60 ; done
this ran overnight (and was upped to 128 step during the night)

Increasing pgp_num was trickier in our case, first because it was heavy 
production and we wanted to minimize the visible impact and second because of 
wildly differing free space on the OSDs.
We did it again in steps and waited for the cluster to settle before continuing.
Each step upped pgp_num by about 2% and as we got higher (8192) we increased this to 
much more - the last step was 15360-16384 with the same impact the initial 
4096-4160 had.

The strategy you adopted looks great. I'll do some experiments on a test 
cluster to evaluate the real impact in each step

The end result is much better but still nowhere near optimal - bigger impact 
would be upgrading to a newer Ceph release and setting the new tunables because 
we’re running Dumpling.

Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), 
and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it 
only had about 1GB before. That’s a lot of memory and space with higher OSD 
counts...

This is a good point. So along with the increment of PGs, we also need to take 
the current status of the cluster(the available disk space and memory for each 
OSD) into account and evaluate whether it is needed to add more resources.

Depends on how much free space you have. We had some OSDs at close to 85% 
capacity before we started (and other OSD’s at only 30%). When increasing the 
number of PGs the data shuffled greatly - but this depends on what CRUSH rules 
you have (and what version you are running). Newer versions with newer tunables 
will make this a lot easier I guess.


And while I haven’t calculated the number of _objects_ per PG, but we have 
differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another 
hosts 1300) and this seems to be the cause of poor data balancing.

In our environment, we also encountered the imbalance mapping between PGs and 
OSD. What kind of bucket algorithm was used in your environment? Any idea on 
how to minimize it?

We are using straw because of dumpling. Straw2 should make everything better :-)

Jan


Thanks,
Jevon

Jan



On 04 Aug 2015, at 18:52, Marek Dohojda mdoho...@altitudedigital.com wrote:

I have done this not that long ago.  My original PG estimates were wrong and I 
had to increase them.

After increasing the PG numbers the Ceph rebalanced, and that took a while.  To 
be honest in my case the slowdown wasn’t really visible, but it took a while.

My strong suggestion to you would be to do it in a long IO time, and be 
prepared that this willl take quite a long time to accomplish.  Do it slowly  
and do not increase multiple pools at once.

It isn’t recommended practice but doable.




On Aug 4, 2015, at 10:46 AM, Samuel Just sj...@redhat.com wrote:

It will cause a large amount of data movement.  Each new pg after the
split will relocate.  It might be ok if you do it slowly.  Experiment
on a test cluster.
-Sam

On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote:

Hi Cephers,

This is a greeting from Jevon. Currently, I'm experiencing an issue which
suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
More details are provided bellow.

Issue:
I set up a cluster having 24 OSDs and created one pool with 1024 placement
groups on it for a small startup company. The number 1024 was calculated per
the equation 'OSDs * 100'/pool size. The cluster 

Re: About the Ceph erasure pool with ISA plugin on Intel xeon CPU

2015-08-06 Thread Derek Su
Hello, Loic
the following is my steps and configurations:
(1) The 11 osd and 3 monitors were ran in the docker container on the
same host machine.
(2) Each osd had one 1T HDD.

(3)  I set the erasure coding pool profiles:
## Jerasure, reed-soloman
 $ ceph osd erasure-code-profile set reed_k4m2_A k=4 m=2
directory=/usr/lib64/ceph/erasure-code

## ISA, reed-soloman
ceph osd erasure-code-profile set reed_k4m2_isa_A k=4 m=2
directory=/usr/lib64/ceph/erasure-code plugin=isa
technique=reed_sol_van

(4) Then, the erasure pools were created:
## Jerasure, reed-soloman
$ $ceph osd pool create reed_k4m2_A_pool 128 128 erasure reed_k4m2_A

## ISA, reed-soloman
$ ceph osd pool create reed_k4m2_isa_A_pool 128 128 erasure reed_k4m2_isa_A

(5) Then, I used the rados benchmark to test the write performance
## Jerasure, reed-soloman
rados bench  -p reed_k4m2_A_pool 500 write --no-cleanup

## ISA, reed-soloman
rados bench  -p reed_k4m2_isa_A_pool write --no-cleanup


The results:
(1) Jerasure/Reed-Soloman
Write throughput: 136.0 MB/S, Latency: 0.471
(2) ISA/Reed-Soloman
Write throughput: 133.1 MB/S, Latency: 0.481
(3) Jerasure/cauchy_good
Write throughput: 138.3 MB/S, Latency: 0.462
(4) ISA/cauchy
Write throughput: 140.2 MB/S, Latency: 0.452

--
My CPU information:
Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz

$ cat /proc/cpuinfo | grep flags
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms
invpcid xsaveopt

Ram: 12 GiB


The results of the performance tests seem there are no differences...

Thanks, :)
Derek

2015-08-06 20:31 GMT+08:00 Loic Dachary l...@dachary.org:
 Hi,

 Could you please publish the benchmark results somewhere ? I should be able 
 to figure out why you don't see a difference.

 Cheers

 On 06/08/2015 13:25, Derek Su wrote:
 Dear Mr. Dachary and all,

 Recently, I found your blog show the performance tests of erasure
 pools (http://dachary.org/?p=3042 , http://dachary.org/?p=3665).
 The results indicates the write throughput can be enhanced
 significantly using Intel xeon CPU.

 I tried to create an erasure pool with isa plugin, reed_sol_van
 technique, and k/m=4/2 on the Intel(R) Xeon(R) CPU E3-1245 v3 @
 3.40GHz machines.

 However, the results of the rados benchmark showed that there was no
 any difference between the jerasure and isa plugins. It seems very
 strange.

 Do I need to do other configurations in addition to only setting the
 erasure profile?
 In addition, how can I know the erasure pool is accelerated by ISA
 plugin exactly? Is there any command I can use?

 Thanks, :)

 Derek Su.


 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Thanks for quick response Haomai! Please find the backtrace here [1].

[1] - http://paste.openstack.org/show/411139/

Regards,
Unmesh G.
IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 5:31 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase
 
 Could you print your all thread callback via thread apply all bt?
 
 On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Hi,
 
  On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs
 start-up fine (are 'up' and 'in' state); however, others are stuck in the 
 'init
 creating/touching snapmapper object' phase. Below is a OSD start-up log
 snippet:
 
  2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
  2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
  sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
  a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
  2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
  creating/touching snapmapper object
 
  The log statement is inaccurate though, since it is actually doing init
 operation for the 'infos' object (as can be observed from source [2]).
 
  Upon debugging further, the thread seems to be waiting to acquire the
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
 
  (gdb) where
  #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
  /lib/x86_64-linux-gnu/libpthread.so.0
  #1  0x7fd313132bf4 in
  ObjectStore::apply_transactions(ObjectStore::Sequencer*,
  std::listObjectStore::Transaction*,
  std::allocatorObjectStore::Transaction* , Context*) ()
  #2  0x7fd313097d08 in
  ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
  #3  0x7fd313076790 in OSD::init() ()
  #4  0x7fd3130233a7 in main ()
 
  In a few cases, upon restarting the stuck OSD (service), it successfully
 completes the 'init' phase and reaches the 'up' and 'in' state!
 
  Any help is greatly appreciated. Please let me know if any more details are
 required for root causing.
 
  [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
  [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
 
  Regards,
  Unmesh G.
  IRC: unmeshg
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 --
 Best Regards,
 
 Wheat
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang
Could you print your all thread callback via thread apply all bt?

On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Hi,

 On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate 
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck 
 in the 'init creating/touching snapmapper object' phase. Below is a OSD 
 start-up log snippet:

 2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open 
 /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
 bytes, directio = 1, aio = 1
 2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open 
 /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
 bytes, directio = 1, aio = 1
 2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock 
 sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 
 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching 
 snapmapper object

 The log statement is inaccurate though, since it is actually doing init 
 operation for the 'infos' object (as can be observed from source [2]).

 Upon debugging further, the thread seems to be waiting to acquire the 
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:

 (gdb) where
 #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/x86_64-linux-gnu/libpthread.so.0
 #1  0x7fd313132bf4 in 
 ObjectStore::apply_transactions(ObjectStore::Sequencer*, 
 std::listObjectStore::Transaction*, 
 std::allocatorObjectStore::Transaction* , Context*) ()
 #2  0x7fd313097d08 in 
 ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
 #3  0x7fd313076790 in OSD::init() ()
 #4  0x7fd3130233a7 in main ()

 In a few cases, upon restarting the stuck OSD (service), it successfully 
 completes the 'init' phase and reaches the 'up' and 'in' state!

 Any help is greatly appreciated. Please let me know if any more details are 
 required for root causing.

 [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211

 Regards,
 Unmesh G.
 IRC: unmeshg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Please find ceph.conf at [1] and the corresponding OSD log at [2].

To clarify one thing I skipped earlier on, is while bringing up the OSDs, 
'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I 
had to temporarily disable 'journal dio' to get the disk activated (with a 
'mark-init' set to none) and then explicitly start the OSD service after 
updating the conf to enable 'journal dio'. I am hopeful that this should not 
cause the present issue (since few OSD start successfully on first attempt and 
others on subsequent service restarts)!

[1] - http://paste.openstack.org/show/411161/
[2] - http://paste.openstack.org/show/411162/
[3] - http://tracker.ceph.com/issues/9768

Regards,
Unmesh G.
IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 6:22 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase
 
 Don't find something strange.
 
 Could you paste your ceph.conf? And restart this osd with debug_osd=20/20,
 debug_filestore=20/20 :-)
 
 On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Thanks for quick response Haomai! Please find the backtrace here [1].
 
  [1] - http://paste.openstack.org/show/411139/
 
  Regards,
  Unmesh G.
  IRC: unmeshg
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Thursday, August 06, 2015 5:31 PM
  To: Gurjar, Unmesh
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: OSD sometimes stuck in init phase
 
  Could you print your all thread callback via thread apply all bt?
 
  On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
  wrote:
   Hi,
  
   On a Ceph Firefly cluster (version [1]), OSDs are configured to use
   separate
  data and journal disks (using the ceph-disk utility). It is observed,
  that few OSDs start-up fine (are 'up' and 'in' state); however,
  others are stuck in the 'init creating/touching snapmapper object'
  phase. Below is a OSD start-up log
  snippet:
  
   2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
   2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
   sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
   a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
   2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
   creating/touching snapmapper object
  
   The log statement is inaccurate though, since it is actually doing
   init
  operation for the 'infos' object (as can be observed from source [2]).
  
   Upon debugging further, the thread seems to be waiting to acquire
   the
  'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
  
   (gdb) where
   #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
   /lib/x86_64-linux-gnu/libpthread.so.0
   #1  0x7fd313132bf4 in
   ObjectStore::apply_transactions(ObjectStore::Sequencer*,
   std::listObjectStore::Transaction*,
   std::allocatorObjectStore::Transaction* , Context*) ()
   #2  0x7fd313097d08 in
   ObjectStore::apply_transaction(ObjectStore::Transaction, Context*)
   ()
   #3  0x7fd313076790 in OSD::init() ()
   #4  0x7fd3130233a7 in main ()
  
   In a few cases, upon restarting the stuck OSD (service), it
   successfully
  completes the 'init' phase and reaches the 'up' and 'in' state!
  
   Any help is greatly appreciated. Please let me know if any more
   details are
  required for root causing.
  
   [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
   [2] -
   https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
  
   Regards,
   Unmesh G.
   IRC: unmeshg
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel
   in the body of a message to majord...@vger.kernel.org More
   majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
  --
  Best Regards,
 
  Wheat
 
 
 
 --
 Best Regards,
 
 Wheat


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's 
 inode list searching for dirty items.  I've always assumed that it was 
 only traversing dirty inodes (e.g., a list of dirty inodes), but that 
 appears not to be the case, even on the latest kernels.

I'm pretty sure Dave had some patches for that,  Even if they aren't
included it's not an unsolved problem.

 The main thing to watch out for is that according to POSIX you really need 
 to fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we 
 don't want to allow data loss on e.g. ext4 (we need to check what the 
 metadata ordering behavior is there) or other file systems.

That additional fsync in XFS is basically free, so better get it right
and let the file system micro optimize for you.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Haomai Wang wrote:
 Agree
 
 On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote:
  Thanks Sage for digging down..I was suspecting something similar.. As I 
  mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
  64 GB of RAM in the system.
  The workaround I was talking about today  is working pretty good so far. In 
  this implementation, I am not giving much work to syncfs as each worker 
  thread is writing with o_dsync mode. I am issuing syncfs before trimming 
  the journal and most of the time I saw it is taking  100 ms.
 
 Actually I prefer we don't use syncfs anymore. I more like to use
 aio+dio+Filestore custom cache to deal with all syncfs+pagecache
 things. So we even can make cache more smart to aware of upper levels
 instead of fadvise* calls. Second we can use checkpoint method like
 mysql innodb, we can know the bw of frontend(filejournal) and decide
 how much and how often we want to flush(using aio+dio).
 
 Anyway, because it's a big project, we may prefer to work at newstore
 instead of filestore.
 
  I have to wake up the sync_thread now after each worker thread finished 
  writing. I will benchmark both the approaches. As we discussed earlier, in 
  case of only fsync approach, we still need to do a db sync to make sure the 
  leveldb stuff persisted, right ?
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Sage Weil [mailto:sw...@redhat.com]
  Sent: Wednesday, August 05, 2015 2:27 PM
  To: Somnath Roy
  Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
  Subject: FileStore should not use syncfs(2)
 
  Today I learned that syncfs(2) does an O(n) search of the superblock's 
  inode list searching for dirty items.  I've always assumed that it was only 
  traversing dirty inodes (e.g., a list of dirty inodes), but that appears 
  not to be the case, even on the latest kernels.
 
  That means that the more RAM in the box, the larger (generally) the inode 
  cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
  it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
  servicing a very light workload, and each syncfs(2) call was taking ~7 
  seconds (usually to write out a single inode).
 
  A possible workaround for such boxes is to turn 
  /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
  pages instead of inodes/dentries)...
 
  I think the take-away though is that we do need to bite the bullet and make 
  FileStore f[data]sync all the right things so that the syncfs call can be 
  avoided.  This is the path you were originally headed down, Somnath, and I 
  think it's the right one.
 
  The main thing to watch out for is that according to POSIX you really need 
  to fsync directories.  With XFS that isn't the case since all metadata 
  operations are going into the journal and that's fully ordered, but we 
  don't want to allow data loss on e.g. ext4 (we need to check what the 
  metadata ordering behavior is there) or other file systems.
 
 I guess there only a little directory modify operations, is it true?
 Maybe we only need to do syncfs when modifying directories?

I'd say there are a few broad cases:

 - creating or deleting objects.  simply fsyncing the file is 
sufficient on XFS; we should confirm what the behavior is on other 
distros.  But even if we d the fsync on the dir this is simple to 
implement.

 - renaming objects (collection_move_rename).  Easy to add an fsync here.

 - HashIndex rehashing.  This is where I get nervous... and setting some 
flag that triggers a full syncfs might be an interim solution since it's a 
pretty rare event.  OTOH, adding the fsync calls in the HashIndex code 
probably isn't so bad to audit and get right either...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Christoph Hellwig wrote:
 On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
  Today I learned that syncfs(2) does an O(n) search of the superblock's 
  inode list searching for dirty items.  I've always assumed that it was 
  only traversing dirty inodes (e.g., a list of dirty inodes), but that 
  appears not to be the case, even on the latest kernels.
 
 I'm pretty sure Dave had some patches for that,  Even if they aren't
 included it's not an unsolved problem.
 
  The main thing to watch out for is that according to POSIX you really need 
  to fsync directories.  With XFS that isn't the case since all metadata 
  operations are going into the journal and that's fully ordered, but we 
  don't want to allow data loss on e.g. ext4 (we need to check what the 
  metadata ordering behavior is there) or other file systems.
 
 That additional fsync in XFS is basically free, so better get it right
 and let the file system micro optimize for you.

I'm guessing the strategy here should be to fsync the file (leaf) and then 
any affected ancestors, such that the directory fsyncs are effectively 
no-ops?  Or does it matter?

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Hi,

On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate 
data and journal disks (using the ceph-disk utility). It is observed, that few 
OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 
'init creating/touching snapmapper object' phase. Below is a OSD start-up log 
snippet:

2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open 
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open 
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock 
sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 
a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching 
snapmapper object

The log statement is inaccurate though, since it is actually doing init 
operation for the 'infos' object (as can be observed from source [2]).

Upon debugging further, the thread seems to be waiting to acquire the 
'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:

(gdb) where
#0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fd313132bf4 in 
ObjectStore::apply_transactions(ObjectStore::Sequencer*, 
std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* 
, Context*) ()
#2  0x7fd313097d08 in 
ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
#3  0x7fd313076790 in OSD::init() ()
#4  0x7fd3130233a7 in main ()

In a few cases, upon restarting the stuck OSD (service), it successfully 
completes the 'init' phase and reaches the 'up' and 'in' state! 

Any help is greatly appreciated. Please let me know if any more details are 
required for root causing.

[1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
[2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211

Regards,
Unmesh G.
IRC: unmeshg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Consult some problems of Ceph when reading source code

2015-08-06 Thread Sage Weil
Hi!

On Thu, 6 Aug 2015, ?? wrote:
 Dear developers,
 
 My name is Cai Yi, and I am a graduate student majored in CS of Xi?an 
 Jiaotong University in China. From Ceph?s homepage, I know Sage is the 
 author of Ceph and I get the email address from your GitHub and Ceph?s 
 official website. Because Ceph is an excellent distributed file system, 
 so recently, I am reading the source code of the Ceph (the edition is 
 Hammer) to understand the IO good path and the performance of Ceph. 
 However, I face some problems which I could not find the solution from 
 Internet or solve by myself and my partners. So I was wondering if you 
 could help us solve some problems. The problems are as follows:
 
 1)  In the Ceph, there is a concept that is the transaction. When the 
 OSD receives a write request, and then it is encapsulated by a 
 transaction. But When the OSD receive many requests, is there a 
 transaction queue to receive the messages? If there is a queue, is it a 
 process of serial or parallel to submit these transaction to do next 
 operation? If it is serial, could the transaction operations influence 
 the performance?

The requests are distributed across placement groups and into a shared 
work queue, implemented by ShardedWQ in common/WorkQueue.h.  This 
serializes processing for a given PG, but this generally makes little 
difference as there are typically 100 or more PGs per OSD.

 2)  From some documents about Ceph, if the OSD receives a read request, 
 the OSD can only read data from primary and then back to client. Is the 
 description right?

Yes.  This is usually the right thing to do or else a given object will 
end up consuming cache (memory) on more than one OSD and the overall cache 
efficiency of the cluster will drop by your replication factor.  It's only 
a win to distributed reads when you have a very hot object, or when you 
want to spend OSD resources by reduce latency (e.g., by sending reads to 
all replica and taking the fastest reply).

 Is there any way to read the data from replicated 
 OSD? Do we have to require the data from the primary OSD when deal with 
 the read request? If not and we can read from replicated OSD, could we 
 promise the consistency?

There is a client-side flag to read from a random or the closest 
replica, but there are a few bugs that affect consistency when recovery is 
underway that are being fixed up now.  It is likely that this will work 
correctly in Infernalis, the next stable release.

 3)  When the OSD receives the message, the message?s attribute may be 
 the normal dispatch or the fast dispatch. What is the difference between 
 the normal dispatch and the fast dispatch? If the attribute is the 
 normal dispatch, it enters the dispatch queue. Is there a single 
 dispatch queue or multi dispatch queue to deal with all the messages?

There is a single thread that does the normal dispatch.  Fast dispatch 
processes the message synchrnonously from the thread that received the 
message, so it faster, but it has to be careful not to block.

 These are the problem I am facing. Thank you for your patience and 
 cooperation, and I look forward to hearing from you.

Hope that helps!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Yan, Zheng wrote:
 On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote:
  Today I learned that syncfs(2) does an O(n) search of the superblock's
  inode list searching for dirty items.  I've always assumed that it was
  only traversing dirty inodes (e.g., a list of dirty inodes), but that
  appears not to be the case, even on the latest kernels.
 
 
 I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
 traverse dirty inodes (inodes in
 bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?

See wait_sb_inodes in fs/fs-writeback.c, called by sync_inodes_sb.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About the Ceph erasure pool with ISA plugin on Intel xeon CPU

2015-08-06 Thread Derek Su
Dear Mr. Dachary and all,

Recently, I found your blog show the performance tests of erasure
pools (http://dachary.org/?p=3042 , http://dachary.org/?p=3665).
The results indicates the write throughput can be enhanced
significantly using Intel xeon CPU.

I tried to create an erasure pool with isa plugin, reed_sol_van
technique, and k/m=4/2 on the Intel(R) Xeon(R) CPU E3-1245 v3 @
3.40GHz machines.

However, the results of the rados benchmark showed that there was no
any difference between the jerasure and isa plugins. It seems very
strange.

Do I need to do other configurations in addition to only setting the
erasure profile?
In addition, how can I know the erasure pool is accelerated by ISA
plugin exactly? Is there any command I can use?

Thanks, :)

Derek Su.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie question about metadata_list.

2015-08-06 Thread Ilya Dryomov
On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk
lukasz.szymc...@corp.ovh.com wrote:
 Hi,

 I'm writing some program to replace image in cluster with it's copy.
 But I have problem with metadata_list.
 I created pool:
 #rados mkpool dupa
 then I created image:
 #rbd create --size 1000 -p mypool image --image-format 2

 Below is code which tries to get metadata, but it fails with -EOPNOTSUPP.
 I compile it with command like this:
 g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src

 #include stdio.h
 #include stdlib.h
 #include include/rbd/librbd.h
 #include include/rbd/librbd.hpp
 using namespace ceph;
 #include librbd/ImageCtx.h

 int main() {
 rados_t clu;
 int ret = rados_create(clu, NULL);
 if (ret) return -1;

 ret = rados_conf_read_file(clu, NULL);
 if (ret) return -1;

 rados_conf_parse_env(clu, NULL);
 ret = rados_connect(clu);
 if (ret) return -1;

 rados_ioctx_t io;
 ret = rados_ioctx_create(clu, mypool, io);
 if (ret) return -1;

 rbd_image_t im;
 ret = rbd_open(io, image, im, NULL);
 if (ret) return -1;

 librbd::ImageCtx *ic = (librbd::ImageCtx*)im;
 std::string start;
 int max = 1000;
 bufferlist in, out;
 ::encode(start, in);
 ::encode(max, in);
 ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, 
 metadata_list, in, out);
 if (ret  0) printf(fail\n);

You should use rbd_metadata_list() C API instead of this.


 return 0;

 }

 So, my question is: what should be set/enabled to get those metadata?
 Or maybe what I'm doing wrong here.

Try rbd image-meta list image?

It's a fairly recent feature, are you sure your OSDs support it?
What's the output of ceph daemon osd.0 version?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang
Don't find something strange.

Could you paste your ceph.conf? And restart this osd with
debug_osd=20/20, debug_filestore=20/20 :-)

On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Thanks for quick response Haomai! Please find the backtrace here [1].

 [1] - http://paste.openstack.org/show/411139/

 Regards,
 Unmesh G.
 IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 5:31 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase

 Could you print your all thread callback via thread apply all bt?

 On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Hi,
 
  On a Ceph Firefly cluster (version [1]), OSDs are configured to use 
  separate
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs
 start-up fine (are 'up' and 'in' state); however, others are stuck in the 
 'init
 creating/touching snapmapper object' phase. Below is a OSD start-up log
 snippet:
 
  2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
  2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
  sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
  a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
  2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
  creating/touching snapmapper object
 
  The log statement is inaccurate though, since it is actually doing init
 operation for the 'infos' object (as can be observed from source [2]).
 
  Upon debugging further, the thread seems to be waiting to acquire the
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
 
  (gdb) where
  #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
  /lib/x86_64-linux-gnu/libpthread.so.0
  #1  0x7fd313132bf4 in
  ObjectStore::apply_transactions(ObjectStore::Sequencer*,
  std::listObjectStore::Transaction*,
  std::allocatorObjectStore::Transaction* , Context*) ()
  #2  0x7fd313097d08 in
  ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
  #3  0x7fd313076790 in OSD::init() ()
  #4  0x7fd3130233a7 in main ()
 
  In a few cases, upon restarting the stuck OSD (service), it successfully
 completes the 'init' phase and reaches the 'up' and 'in' state!
 
  Any help is greatly appreciated. Please let me know if any more details are
 required for root causing.
 
  [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
  [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
 
  Regards,
  Unmesh G.
  IRC: unmeshg
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote:
 I'm guessing the strategy here should be to fsync the file (leaf) and then 
 any affected ancestors, such that the directory fsyncs are effectively 
 no-ops?  Or does it matter?

All metadata transactions log the involve parties (parent and child
inode(s) mostly) in the same transaction.  So flushing one of them out
is enough.  But file data I/O might dirty the inode before flushing them
out, so to not need to write out the inode log item twice you first want
to fsync any file that had data I/O followed by directories or special
files that only had metadata modified.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang
It seemed filestore doesn't do transaction as expected. Sorry, you
need to add debug_journal=20/20 to help find the reason. :-)

BTW, what's your os version? How many osds do you have in this
cluster, how many osds failed to start like this?

On Thu, Aug 6, 2015 at 9:17 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Please find ceph.conf at [1] and the corresponding OSD log at [2].

 To clarify one thing I skipped earlier on, is while bringing up the OSDs, 
 'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I 
 had to temporarily disable 'journal dio' to get the disk activated (with a 
 'mark-init' set to none) and then explicitly start the OSD service after 
 updating the conf to enable 'journal dio'. I am hopeful that this should not 
 cause the present issue (since few OSD start successfully on first attempt 
 and others on subsequent service restarts)!

 [1] - http://paste.openstack.org/show/411161/
 [2] - http://paste.openstack.org/show/411162/
 [3] - http://tracker.ceph.com/issues/9768

 Regards,
 Unmesh G.
 IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 6:22 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase

 Don't find something strange.

 Could you paste your ceph.conf? And restart this osd with debug_osd=20/20,
 debug_filestore=20/20 :-)

 On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Thanks for quick response Haomai! Please find the backtrace here [1].
 
  [1] - http://paste.openstack.org/show/411139/
 
  Regards,
  Unmesh G.
  IRC: unmeshg
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Thursday, August 06, 2015 5:31 PM
  To: Gurjar, Unmesh
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: OSD sometimes stuck in init phase
 
  Could you print your all thread callback via thread apply all bt?
 
  On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
  wrote:
   Hi,
  
   On a Ceph Firefly cluster (version [1]), OSDs are configured to use
   separate
  data and journal disks (using the ceph-disk utility). It is observed,
  that few OSDs start-up fine (are 'up' and 'in' state); however,
  others are stuck in the 'init creating/touching snapmapper object'
  phase. Below is a OSD start-up log
  snippet:
  
   2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
   2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
   sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
   a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
   2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
   creating/touching snapmapper object
  
   The log statement is inaccurate though, since it is actually doing
   init
  operation for the 'infos' object (as can be observed from source [2]).
  
   Upon debugging further, the thread seems to be waiting to acquire
   the
  'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
  
   (gdb) where
   #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
   /lib/x86_64-linux-gnu/libpthread.so.0
   #1  0x7fd313132bf4 in
   ObjectStore::apply_transactions(ObjectStore::Sequencer*,
   std::listObjectStore::Transaction*,
   std::allocatorObjectStore::Transaction* , Context*) ()
   #2  0x7fd313097d08 in
   ObjectStore::apply_transaction(ObjectStore::Transaction, Context*)
   ()
   #3  0x7fd313076790 in OSD::init() ()
   #4  0x7fd3130233a7 in main ()
  
   In a few cases, upon restarting the stuck OSD (service), it
   successfully
  completes the 'init' phase and reaches the 'up' and 'in' state!
  
   Any help is greatly appreciated. Please let me know if any more
   details are
  required for root causing.
  
   [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
   [2] -
   https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
  
   Regards,
   Unmesh G.
   IRC: unmeshg
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel
   in the body of a message to majord...@vger.kernel.org More
   majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
  --
  Best Regards,
 
  Wheat



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Erasure Code Plugins : PLUGINS_V3 feature

2015-08-06 Thread Loic Dachary
Hi Takeshi,

https://github.com/ceph/ceph/pull/5493 is ready for your review. The matching 
integration tests can be found at https://github.com/ceph/ceph-qa-suite/pull/523

Cheers

On 06/08/2015 02:28, Miyamae, Takeshi wrote:
 Dear Sage,
 
 note that what this really means is that the on-disk encoding needs to 
 remain fixed.
 
 Thank you for letting us know the important notice.
 We have no plan to change shec's format at this moment, but we will remember 
 the
 comment for any future events.
 
 Best Regards,
 Takeshi Miyamae
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com] 
 Sent: Thursday, August 6, 2015 3:45 AM
 To: Loic Dachary; Miyamae, Takeshi/宮前 剛
 Cc: Samuel Just; Ceph Development
 Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature
 
 On Wed, 5 Aug 2015, Loic Dachary wrote:
 Hi Sam,

 How does this proposal sound ? It would be great if that was done 
 before the feature freeze.
 
 I think it's a good time.
 
 Takeshi, note that what this really means is that the on-disk encoding needs 
 to remain fixed.  If we decide to change it down the line, we'll have to make 
 a 'shec2' or similar so that the old format is still decodable (or ensure 
 that existing data can still be read in some other way).
 
 Sound good?
 
 sage
 
 

 Cheers

 On 29/07/2015 11:16, Loic Dachary wrote:
 Hi Sam,

 The SHEC plugin[0] has been running in the rados runs[1] in the past few 
 months. It also has a matching corpus verification which runs on every make 
 check[2] as well as its optimized variants. I believe the flag 
 experimental can now be removed. 

 In order to do so, we need to use a PLUGINS_V3 feature, in the same way we 
 did back in Giant when the ISA and LRC plugins were introduced[3]. This 
 won't be necessary in the future, when there is a generic plugin mechanism, 
 but right now that's what we need. It would be a commit very similar to the 
 one implementing PLUGINS_V2[4].

 Is this agreeable to you ? Or would you rather see another way to resolve 
 this ?

 Cheers

 [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec
 [1] 
 https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras
 h-erasure-code-shec [2] 
 https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9
 88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343
 [4] 
 https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420
 7a6d7489


 --
 Loïc Dachary, Artisan Logiciel Libre



-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


testing the teuthology OpenStack backend

2015-08-06 Thread Loic Dachary
Hi,

I'm looking into testing the OpenStack backend for teuthology on a new cluster 
to verify it's portable. I think it is but ... ;-) I'm told you have an 
OpenStack cluster and would be interested in running teuthology workloads on 
it. Does it have a public facing API ? 

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re:Re: Consult some problems of Ceph when reading source code

2015-08-06 Thread 蔡毅
Dear Dr.Sage:
Thank you for your detailed reply!These answers helps me a lot. I also have 
some problems in Question (1.
In your reply, the requests according to the different PG enqueue into the 
ShardedWQ, if I have 3 requests( that is pg1,r1,pg2,r2,pg3,r3), and I put 
them to the ShardedWQ, is the process also aserializes processing?
When I want to dequeue the item from ShardedWQ, there is a work_queues (the 
type is the vector of work_queue) in ThreadPool method(WorkQueue.cc) and then I 
 calculate the work queue according to the work_queues, so is there many work 
queue in the request process?  or is there no association with the ShardedWQ?
When I get the item from ShardedWQ, I will transfer it to the transaction and 
then read or write. Is the process done one by one( another transaction is 
handled only when this transaction is over), if it is, Could we promise the 
performance? if it isn't , Are the transactions'actions parallel?
Thank you a lot!





At 2015-08-06 20:44:45, Sage Weil s...@newdream.net wrote:
Hi!

On Thu, 6 Aug 2015, ?? wrote:
 Dear developers,
 
 My name is Cai Yi, and I am a graduate student majored in CS of Xi?an 
 Jiaotong University in China. From Ceph?s homepage, I know Sage is the 
 author of Ceph and I get the email address from your GitHub and Ceph?s 
 official website. Because Ceph is an excellent distributed file system, 
 so recently, I am reading the source code of the Ceph (the edition is 
 Hammer) to understand the IO good path and the performance of Ceph. 
 However, I face some problems which I could not find the solution from 
 Internet or solve by myself and my partners. So I was wondering if you 
 could help us solve some problems. The problems are as follows:
 
 1)  In the Ceph, there is a concept that is the transaction. When the 
 OSD receives a write request, and then it is encapsulated by a 
 transaction. But When the OSD receive many requests, is there a 
 transaction queue to receive the messages? If there is a queue, is it a 
 process of serial or parallel to submit these transaction to do next 
 operation? If it is serial, could the transaction operations influence 
 the performance?

The requests are distributed across placement groups and into a shared 
work queue, implemented by ShardedWQ in common/WorkQueue.h.  This 
serializes processing for a given PG, but this generally makes little 
difference as there are typically 100 or more PGs per OSD.

 2)  From some documents about Ceph, if the OSD receives a read request, 
 the OSD can only read data from primary and then back to client. Is the 
 description right?

Yes.  This is usually the right thing to do or else a given object will 
end up consuming cache (memory) on more than one OSD and the overall cache 
efficiency of the cluster will drop by your replication factor.  It's only 
a win to distributed reads when you have a very hot object, or when you 
want to spend OSD resources by reduce latency (e.g., by sending reads to 
all replica and taking the fastest reply).

 Is there any way to read the data from replicated 
 OSD? Do we have to require the data from the primary OSD when deal with 
 the read request? If not and we can read from replicated OSD, could we 
 promise the consistency?

There is a client-side flag to read from a random or the closest 
replica, but there are a few bugs that affect consistency when recovery is 
underway that are being fixed up now.  It is likely that this will work 
correctly in Infernalis, the next stable release.

 3)  When the OSD receives the message, the message?s attribute may be 
 the normal dispatch or the fast dispatch. What is the difference between 
 the normal dispatch and the fast dispatch? If the attribute is the 
 normal dispatch, it enters the dispatch queue. Is there a single 
 dispatch queue or multi dispatch queue to deal with all the messages?

There is a single thread that does the normal dispatch.  Fast dispatch 
processes the message synchrnonously from the thread that received the 
message, so it faster, but it has to be careful not to block.

 These are the problem I am facing. Thank you for your patience and 
 cooperation, and I look forward to hearing from you.

Hope that helps!
sage
N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏渮榏z鳐妠ay�蕠跈�,jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸濟!秈

Re: radosgw + civetweb latency issue on Hammer

2015-08-06 Thread Mark Nelson

Hi Srikanth,

Can you make a ticket on tracker.ceph.com for this?  We'd like to not 
loose track of it.


Thanks!
Mark

On 08/05/2015 07:01 PM, Srikanth Madugundi wrote:

Hi,

After upgrading to Hammer and moving from apache to civetweb. We
started seeing high PUT latency in the order of 2 sec for every PUT
request. The GET request lo

Attaching the radosgw logs for a single request. The ceph.conf has the
following configuration for civetweb.

[client.radosgw.gateway]
rgw frontends = civetweb port=5632


Further investion reveled the call to get_data() at
https://github.com/ceph/ceph/blob/hammer/src/rgw/rgw_op.cc#L1786 is
taking 2 sec to respond. The cluster is running Hammer 94.2 release

Did any one face this issue before? Is there some configuration I am missing?

Regards
Srikanth


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html