is it conceivable that there's too much traffic, causing Solr to stall re-opening the searcher (thus releasing to the new index)? I'm grasping at straws, and this is beginning to bug me a lot. The traffic logs wouldn't seem to support this (apart from periodic health-check pings, the load is distributed fairly evenly across 3 slaves by a load-balancer tool). After 35+ minutes this morning, none of the three successfully "unstuck", and had to be manually core-reloaded.
Is there perhaps a configuration element I'm overlooking that might make solr a bit less "friendly" about it, and just dump the searchers/reopen when replication completes? As a side note, I'm getting really frustrated with trying to get log4j logging on 4.3.1 set up; my tomcat container persists in complaining that it cannot find log4j.properties, when I've put it in the WEB-INF/classes of the war file, have SLF4j libraries AND log4j at the shared container "lib" level, and log4j.debug turned on. I can't find any excuses why it cannot seem to locate the configuration. Any suggestions or pointers would be greatly appreciated. Thanks! On Thu, Jun 27, 2013 at 10:35 AM, Mark Miller <markrmil...@gmail.com> wrote: > Odd - looks like it's stuck waiting to be notified that a new searcher is > ready. > > - Mark > > On Jun 27, 2013, at 8:58 AM, Neal Ensor <nen...@gmail.com> wrote: > > > Okay, I have done this (updated to 4.3.1 across master and four slaves; > one > > of these is my own PC for experiments, it is not being accessed by > clients). > > > > Just had a minor replication this morning, and all three slaves are > "stuck" > > again. Replication supposedly started at 8:40, ended 30 seconds later or > > so (on my local PC, set up identically to the other three slaves). The > > three slaves will NOT complete the roll-over to the new index. All three > > index folders have a write.lock and latest files are dated 8:40am (now it > > is 8:54am, with no further activity in the index folders). There exists > an > > "index.20130627084000061" (or some variation thereof) in all three > slaves' > > data folder. > > > > The seemingly-relevant thread dump of a "snappuller" thread on each of > > these slaves: > > > > - sun.misc.Unsafe.park(Native Method) > > - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) > > - > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) > > - > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) > > - > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) > > - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) > > - java.util.concurrent.FutureTask.get(FutureTask.java:83) > > - > > > org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631) > > - > > > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446) > > - > > > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317) > > - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223) > > - > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > > - > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > > - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > > - > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > > - > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > > - > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > > - > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > > - > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > > - java.lang.Thread.run(Thread.java:662) > > > > > > Here they sit. My local PC "slave" replicated very quickly, switched > over > > to the new generation (206) immediately. I am not sure why the three > > slaves are dragging on this. If there's any configuration elements or > > other details you need, please let me know. I can manually "kick" them > by > > reloading the core from the admin pages, but obviously I would like this > to > > be a hands-off process. Any help is greatly appreciated; this has been > > bugging me for some time now. > > > > > > > > On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar < > > shalinman...@gmail.com> wrote: > > > >> A bunch of replication related issues were fixed in 4.2.1 so you're > >> better off upgrading to 4.2.1 or later (4.3.1 is the latest release). > >> > >> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nen...@gmail.com> wrote: > >>> As a bit of background, we run a setup (coming from 3.6.1 to 4.2 > >> relatively > >>> recently) with a single master receiving updates with three slaves > >> pulling > >>> changes in. Our index is around 5 million documents, around 26GB in > size > >>> total. > >>> > >>> The situation I'm seeing is this: occasionally we update the master, > and > >>> replication begins on the three slaves, seems to proceed normally until > >> it > >>> hits the end. At that point, it "sticks"; there's no messages going on > >> in > >>> the logs, nothing on the admin page seems to be happening. I sit there > >> for > >>> sometimes upwards of 30 minutes, seeing no further activity in the > index > >>> folder(s). After a while, I go to the core admin page and manually > >> reload > >>> the core, which "catches it up". It seems like the index readers / > >> writers > >>> are not releasing the index otherwise? The configuration is set to > >> reopen; > >>> very occasionally this situation actually fixes itself after a longish > >>> period of time, but it seems very annoying. > >>> > >>> I had at first suspected this to be due to our underlying shared (SAN) > >>> storage, so we installed SSDs in all three slave machines, and moved > the > >>> entire indexes to those. It did not seem to affect this issue at all > >>> (additionally, I didn't really see the expected performance boost, but > >>> that's a separate issue entirely). > >>> > >>> Any ideas? Any configuration details I might share/reconfigure? Any > >>> suggestions are appreciated. I could also upgrade to the later 4.3+ > >>> versions, if that might help. > >>> > >>> Thanks! > >>> > >>> Neal Ensor > >>> nen...@gmail.com > >> > >> > >> > >> -- > >> Regards, > >> Shalin Shekhar Mangar. > >> > >