Hi all, We recently started running into this solr slave server freeze up problem. After looking into the logs and the timing of such occurrences, it seems that the problem always follows the first replication after an optimization. Once the server freezes up, we are unable to ssh into it, but ping still returns fine. The only way to recover is by rebooting the machine.
In our replication setup, the masters are optimized nightly because we have a fairly large index (~60GB per master) and are adding millions of documents everyday. After the optimization, a snapshot happens automatically. When replication kicks in, the corresonding slave server will retrieve the snapshot using rsync. Here is the snappuller.log capturing one of the failed pull and one successful pull before and after it: 2009/05/21 22:55:01 started by biz360 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ... 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402 2009/05/21 22:55:11 ended (elapsed time: 10 sec) ##### optimization completes sometime during this gap, and a new snapshot is created 2009/05/21 23:55:01 started by biz360 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ... 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922 ##### slave freezes up, and machine has to be rebooted 2009/05/22 01:55:02 started by biz360 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ... 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528 2009/05/22 02:56:12 ended (elapsed time: 3670 sec) A more detailed debug log shows snappuller simply stopped at some point: started by biz360 command: /mnt/solr/bin/snappuller ... pulling snapshot snapshot.20090521233922 receiving file list ... done deleting segments_16a deleting _cwu.tis deleting _cwu.tii deleting _cwu.prx deleting _cwu.nrm deleting _cwu.frq deleting _cwu.fnm deleting _cwt.tis deleting _cwt.tii deleting _cwt.prx deleting _cwt.nrm deleting _cwt.frq deleting _cwt.fnm deleting _cws.tis deleting _cws.tii deleting _cws.prx deleting _cws.nrm deleting _cws.frq deleting _cws.fnm deleting _cwr_1.del deleting _cwr.tis deleting _cwr.tii deleting _cwr.prx deleting _cwr.nrm deleting _cwr.frq deleting _cwr.fnm deleting _cwq.tis deleting _cwq.tii deleting _cwq.prx deleting _cwq.nrm deleting _cwq.frq deleting _cwq.fnm deleting _cwq.fdx deleting _cwq.fdt deleting _cwp.tis deleting _cwp.tii deleting _cwp.prx deleting _cwp.nrm deleting _cwp.frq deleting _cwq.fnm deleting _cwq.fdx deleting _cwq.fdt deleting _cwp.tis deleting _cwp.tii deleting _cwp.prx deleting _cwp.nrm deleting _cwp.frq deleting _cwp.fnm deleting _cwp.fdx deleting _cwp.fdt deleting _cwo_1.del deleting _cwo.tis deleting _cwo.tii deleting _cwo.prx deleting _cwo.nrm deleting _cwo.frq deleting _cwo.fnm deleting _cwo.fdx deleting _cwo.fdt deleting _cwe_1.del deleting _cwe.tis deleting _cwe.tii deleting _cwe.prx deleting _cwe.nrm deleting _cwe.frq deleting _cwe.fnm deleting _cwe.fdx deleting _cwe.fdt deleting _cw2_3.del deleting _cw2.tis deleting _cw2.tii deleting _cw2.prx deleting _cw2.nrm deleting _cw2.frq deleting _cw2.fnm deleting _cw2.fdx deleting _cw2.fdt deleting _cvs_4.del deleting _cvs.tis deleting _cvs.tii deleting _cvs.prx deleting _cvs.nrm deleting _cvs.frq deleting _cvs.fnm deleting _cvs.fdx deleting _cvs.fdt deleting _csp_h.del deleting _csp.tis deleting _csp.tii deleting _csp.prx deleting _csp.nrm deleting _csp.frq deleting _csp.fnm deleting _csp.fdx deleting _csp.fdt deleting _cpn_q.del deleting _cpn.tis deleting _cpn.tii deleting _cpn.prx deleting _cpn.nrm deleting _cpn.frq deleting _cpn.fnm deleting _cpn.fdx deleting _cpn.fdt deleting _cmk_x.del deleting _cmk.tis deleting _cmk.tii deleting _cmk.prx deleting _cmk.nrm deleting _cmk.frq deleting _cmk.fnm deleting _cmk.fdx deleting _cmk.fdt deleting _cjg_14.del deleting _cjg.tis deleting _cjg.tii deleting _cjg.prx deleting _cjg.nrm deleting _cjg.frq deleting _cjg.fnm deleting _cjg.fdx deleting _cjg.fdt deleting _cge_19.del deleting _cge.tis deleting _cge.tii deleting _cge.prx deleting _cge.nrm deleting _cge.frq deleting _cge.fnm deleting _cge.fdx deleting _cge.fdt deleting _cd9_1m.del deleting _cd9.tis deleting _cd9.tii deleting _cd9.prx deleting _cd9.nrm deleting _cd9.frq deleting _cd9.fnm deleting _cd9.fdx deleting _cd9.fdt ./ _cww.fdt We have random Solr slaves failing in the exact same manner almost daily. Any help is appreciated!