Hi all,

We recently started running into this solr slave server freeze up problem.
After looking into the logs and the timing of such occurrences, it seems
that the problem always follows the first replication after an
optimization.  Once the server freezes up, we are unable to ssh into it, but
ping still returns fine.  The only way to recover is by rebooting the
machine.

In our replication setup, the masters are optimized nightly because we have
a fairly large index (~60GB per master) and are adding millions of documents
everyday.  After the optimization, a snapshot happens automatically.  When
replication kicks in, the corresonding slave server will retrieve the
snapshot using rsync.

Here is the snappuller.log capturing one of the failed pull and one
successful pull before and after it:

2009/05/21 22:55:01 started by biz360
2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
2009/05/21 22:55:11 ended (elapsed time: 10 sec)

##### optimization completes sometime during this gap, and a new snapshot is
created

2009/05/21 23:55:01 started by biz360
2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922

##### slave freezes up, and machine has to be rebooted

2009/05/22 01:55:02 started by biz360
2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
2009/05/22 02:56:12 ended (elapsed time: 3670 sec)


A more detailed debug log shows snappuller simply stopped at some point:

started by biz360
command: /mnt/solr/bin/snappuller ...
pulling snapshot snapshot.20090521233922
receiving file list ... done
deleting segments_16a
deleting _cwu.tis
deleting _cwu.tii
deleting _cwu.prx
deleting _cwu.nrm
deleting _cwu.frq
deleting _cwu.fnm
deleting _cwt.tis
deleting _cwt.tii
deleting _cwt.prx
deleting _cwt.nrm
deleting _cwt.frq
deleting _cwt.fnm
deleting _cws.tis
deleting _cws.tii
deleting _cws.prx
deleting _cws.nrm
deleting _cws.frq
deleting _cws.fnm
deleting _cwr_1.del
deleting _cwr.tis
deleting _cwr.tii
deleting _cwr.prx
deleting _cwr.nrm
deleting _cwr.frq
deleting _cwr.fnm
deleting _cwq.tis
deleting _cwq.tii
deleting _cwq.prx
deleting _cwq.nrm
deleting _cwq.frq
deleting _cwq.fnm
deleting _cwq.fdx
deleting _cwq.fdt
deleting _cwp.tis
deleting _cwp.tii
deleting _cwp.prx
deleting _cwp.nrm
deleting _cwp.frq
deleting _cwq.fnm
deleting _cwq.fdx
deleting _cwq.fdt
deleting _cwp.tis
deleting _cwp.tii
deleting _cwp.prx
deleting _cwp.nrm
deleting _cwp.frq
deleting _cwp.fnm
deleting _cwp.fdx
deleting _cwp.fdt
deleting _cwo_1.del
deleting _cwo.tis
deleting _cwo.tii
deleting _cwo.prx
deleting _cwo.nrm
deleting _cwo.frq
deleting _cwo.fnm
deleting _cwo.fdx
deleting _cwo.fdt
deleting _cwe_1.del
deleting _cwe.tis
deleting _cwe.tii
deleting _cwe.prx
deleting _cwe.nrm
deleting _cwe.frq
deleting _cwe.fnm
deleting _cwe.fdx
deleting _cwe.fdt
deleting _cw2_3.del
deleting _cw2.tis
deleting _cw2.tii
deleting _cw2.prx
deleting _cw2.nrm
deleting _cw2.frq
deleting _cw2.fnm
deleting _cw2.fdx
deleting _cw2.fdt
deleting _cvs_4.del
deleting _cvs.tis
deleting _cvs.tii
deleting _cvs.prx
deleting _cvs.nrm
deleting _cvs.frq
deleting _cvs.fnm
deleting _cvs.fdx
deleting _cvs.fdt
deleting _csp_h.del
deleting _csp.tis
deleting _csp.tii
deleting _csp.prx
deleting _csp.nrm
deleting _csp.frq
deleting _csp.fnm
deleting _csp.fdx
deleting _csp.fdt
deleting _cpn_q.del
deleting _cpn.tis
deleting _cpn.tii
deleting _cpn.prx
deleting _cpn.nrm
deleting _cpn.frq
deleting _cpn.fnm
deleting _cpn.fdx
deleting _cpn.fdt
deleting _cmk_x.del
deleting _cmk.tis
deleting _cmk.tii
deleting _cmk.prx
deleting _cmk.nrm
deleting _cmk.frq
deleting _cmk.fnm
deleting _cmk.fdx
deleting _cmk.fdt
deleting _cjg_14.del
deleting _cjg.tis
deleting _cjg.tii
deleting _cjg.prx
deleting _cjg.nrm
deleting _cjg.frq
deleting _cjg.fnm
deleting _cjg.fdx
deleting _cjg.fdt
deleting _cge_19.del
deleting _cge.tis
deleting _cge.tii
deleting _cge.prx
deleting _cge.nrm
deleting _cge.frq
deleting _cge.fnm
deleting _cge.fdx
deleting _cge.fdt
deleting _cd9_1m.del
deleting _cd9.tis
deleting _cd9.tii
deleting _cd9.prx
deleting _cd9.nrm
deleting _cd9.frq
deleting _cd9.fnm
deleting _cd9.fdx
deleting _cd9.fdt
./
_cww.fdt

We have random Solr slaves failing in the exact same manner almost daily.
Any help is appreciated!

Reply via email to