Hi Adrian, Clearly, the shutdown is happening at a time when ManifoldCF is in the midst of an ANALYZE TABLE operation. The shutdown is attempting to interrupt this operation, and is probably succeeding, but since it is happening when a transaction is being closed, it apparently winds up retrying the whole thing at that point -- which goes right back into ANALYZE, only to be interrupted again one second later.
I still can't see why it retries, but I should be able to analyze that much from here. I'll create a ticket for the problem. Karl On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]> wrote: > At that point, only a few (three or so) minutes. > > > > I left it for another 5 or 6 six minutes after I grabbed the stack trace > before I finally ‘-9’ed it. > > > > HTH, > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 04 February 2015 15:21 > > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Thanks for the trace. > > Here's what I get out of it: > > The shutdown thread is waiting for all the threads to terminate: > > >>>>>> > "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in > Object.wait() [0x00007fe710116000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048) > - locked <0x00000000e70cf2e8> (a java.lang.Integer) > at > org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) > at > org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) > at > org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) > - locked <0x00000000eafa3460> (a java.util.HashMap) > at > org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) > at > org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340) > - locked <0x00000000eafcb208> (a java.util.ArrayList) > - locked <0x00000000eafcb2a0> (a java.lang.Integer) > at > org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565) > <<<<<< > > There are a lot of Zookeeper threads still alive, but those don't matter > here. There is precisely one thread that is blocking shutdown: > > >>>>>> > "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in > Object.wait() [0x00007fe71f7f6000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x00000000e851a0b8> (a > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread) > at java.lang.Thread.join(Thread.java:1355) > at > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694) > at > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) > at > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790) > at > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) > at > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) > at > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576) > at > org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429) > at > org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231) > at > org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575) > at > org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238) > <<<<<< > > This is performing a database ANALYZE TABLE operation, which *should* be > interruptible, given the trace, but is apparently *not*. I'll have to look > at this to find a reason why it won't interrupt. Offhand, I can see no > reason for it. > > When you got this thread dump, how long had the system been waiting to > shut down? > > Karl > > > > On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]> > wrote: > > Hi Karl, > > > > It’s taken me a little while before I needed to do some work on the > server, but here’s a ‘jstack’ dump of an agents run that’s not responding > to a stop request. > > > > Hope it’s helpful, > > > > Adrian > > > > *From:* Adrian Conlon [mailto:[email protected]] > *Sent:* 12 January 2015 15:35 > *To:* [email protected] > *Subject:* RE: Error: Unexpected jobqueue status > > > > Thanks Karl. Restarting the job manually did fix the problem. > > > > I might add a check for this in my software and kick the job into life > again automatically, now I know it works… > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected] <[email protected]>] > *Sent:* 12 January 2015 11:34 > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Hi Adrian, > > Just restarting the job should be sufficient to get it sorted out after > this kind of failure. > > Karl > > > > On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]> > wrote: > > Thanks Karl. > > > > With regards the runtime environment, apologies for the omission of > Postgresql version. It’s v9.3. > > > > For the stack trace, I’ve just installed a jdk on the problematic server > and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the > next time agents process doesn’t respond to a stop request. > > > > With regards jobs that have this unexpected “jobqueue” status; do they > sort themselves out the next time the job runs? Is there anything I should > do to “help” the job along? > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 12 January 2015 00:27 > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Also, if you are having trouble shutting down the agents process, it would > be great if you could get a thread dump and post it, before you kill -9 it. > > Karl > > > > On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote: > > Hi Adrian, > > If you noted the comment stream in CONNECTORS-590, I was able to > demonstrate conclusively that the problem was in Postgres. I have not seen > the problem in 9.3, but that does not mean it's gone. What version of > Postgresql are you using? > > In any case, while this problem definitely terminates your job, it will > not happen very often. I suspect the frequency of occurrence may depend on > how loaded the database is. > > Karl > > > > On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]> > wrote: > > Hi All, > > > > I’m getting an occurrence of what looks very similar to CONNECTORS-590. > > > > The circumstances are: > > > > 1) MCF Jobs proceeding very slowly (looks like a Postgresql vacuum > is needed) > > 2) Stop tomcat > > 3) Attempt to stop the agents normally > > 4) Wait a minute or two > > 5) Decide to “kill -9” the agents process > > 6) Vacuum the database > > 7) Restart tomcat > > 8) Restart the agents > > > > When I checked the job status page, I found that two of the jobs (out > around 4000 or so) had the following status (or very similar): > > > > Error: Unexpected jobqueue status - record id 1417115392831, expecting > active status, saw 4 > > > > Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2), > using postresql as the crawl database and running on Ubuntu Linux. I’m > using zookeeper style synchronisation. > > > > Let me know if more information etc. is needed or if you think it’s a > new/real issue. > > Adrian > > ____________________________________________________________ > Electronic mail messages entering and leaving Arup business > systems are scanned for acceptability of content and viruses > > > > > > > > >
