** Description changed:

  [Impact]
  Ceph from ubuntu trusty, the version is 0.80.*.
  The bug is that when a message seq number has exceeds the max value of 
unsigned 32 bit which is 4294967295, the unsigned 64 bit seq number will be 
truncated to unsigned 32 bit. But the seq number is supposed to increase one by 
one, and if messenger got a message which has an old seq number, it will just 
drop it.
  
  when it dropped one sub_op/sub_op reply message, then this operation will not 
be able to succeed, it will become a slow request.
  cluster will suffering from slow requests when it hits this bug.
  when it drops other message, might lead to other bad situation ( like droping 
osd_ping message might cause the osd down).
  
  [Test case]
  Deploy a trusty ceph cluster
  Two ways to reproduce, A will take at least 2 months to reproduce, B will be 
much easier:
  A.
  1. ceph daemon osd.<id> set config debug_ms 10
  2. wait for the seq number to increasing to max value of unsigned 32 bit 
value which is 4294967295, we can get the current seq number by doing
  command "grep "reader got message" /var/log/ceph/ceph-osd/ceph-osd.id.log"
  3. you will finally see the messenger will drop message and see below log
  "reader got old message 0 <= 4294967295 0x7fa6854a3e00 
osd_sub_op_reply(client.125216.1:7850657018 3.dc 
e24410dc/rb.0.103e.238e1f29.000000000027/head//3 [] ondisk, result = 0) v2, 
discarding "
  
  B.
  1. ceph daemon osd.<id> set config debug_ms 10
  2. gdb attach <pid of ceph-osd>
  3. set a breakpoint at function Pipe::writer and set a command "set 
out_seq=4294967295" to this breakpoint
  4. continue the process, when it hit the breakpoint multiple times (osd 
number - 1), we can exit the gdb.
- 5. grep other osds log and see if you find message dropping message like :
+ 5. grep other osds log and see if you can find message dropping log like :
  "reader got old message 0 <= 4294967295 0x7fa6854a3e00 
osd_sub_op_reply(client.125216.1:7850657018 3.dc 
e24410dc/rb.0.103e.238e1f29.000000000027/head//3 [] ondisk, result = 0) v2, 
discarding "
  
  [Regression Potential]
  None
  
  [Original Bug Report]
  When in_seq is 0xffffffff, Pipe::reader expects the incoming message's 
sequence to be 0x100000000. But m->get_seq() return 0. So the incoming message 
gets dropped.
  https://tracker.ceph.com/issues/16122

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1798081

Title:
  ceph got slow request because of primary osd drop messages

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1798081/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to