CONNECTORS-1156. Karl
On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright <[email protected]> wrote: > Hi Adrian, > > Clearly, the shutdown is happening at a time when ManifoldCF is in the > midst of an ANALYZE TABLE operation. The shutdown is attempting to > interrupt this operation, and is probably succeeding, but since it is > happening when a transaction is being closed, it apparently winds up > retrying the whole thing at that point -- which goes right back into > ANALYZE, only to be interrupted again one second later. > > I still can't see why it retries, but I should be able to analyze that > much from here. I'll create a ticket for the problem. > > Karl > > > On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]> > wrote: > >> At that point, only a few (three or so) minutes. >> >> >> >> I left it for another 5 or 6 six minutes after I grabbed the stack trace >> before I finally ‘-9’ed it. >> >> >> >> HTH, >> >> >> >> Adrian >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* 04 February 2015 15:21 >> >> *To:* [email protected] >> *Subject:* Re: Error: Unexpected jobqueue status >> >> >> >> Thanks for the trace. >> >> Here's what I get out of it: >> >> The shutdown thread is waiting for all the threads to terminate: >> >> >>>>>> >> "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in >> Object.wait() [0x00007fe710116000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at >> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048) >> - locked <0x00000000e70cf2e8> (a java.lang.Integer) >> at >> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) >> at >> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) >> at >> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) >> - locked <0x00000000eafa3460> (a java.util.HashMap) >> at >> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) >> at >> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340) >> - locked <0x00000000eafcb208> (a java.util.ArrayList) >> - locked <0x00000000eafcb2a0> (a java.lang.Integer) >> at >> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565) >> <<<<<< >> >> There are a lot of Zookeeper threads still alive, but those don't matter >> here. There is precisely one thread that is blocking shutdown: >> >> >>>>>> >> "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in >> Object.wait() [0x00007fe71f7f6000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Thread.join(Thread.java:1281) >> - locked <0x00000000e851a0b8> (a >> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread) >> at java.lang.Thread.join(Thread.java:1355) >> at >> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694) >> at >> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) >> at >> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790) >> at >> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) >> at >> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) >> at >> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) >> at >> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656) >> at >> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431) >> at >> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576) >> at >> org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429) >> at >> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414) >> at >> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231) >> at >> org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575) >> at >> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238) >> <<<<<< >> >> This is performing a database ANALYZE TABLE operation, which *should* be >> interruptible, given the trace, but is apparently *not*. I'll have to look >> at this to find a reason why it won't interrupt. Offhand, I can see no >> reason for it. >> >> When you got this thread dump, how long had the system been waiting to >> shut down? >> >> Karl >> >> >> >> On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]> >> wrote: >> >> Hi Karl, >> >> >> >> It’s taken me a little while before I needed to do some work on the >> server, but here’s a ‘jstack’ dump of an agents run that’s not responding >> to a stop request. >> >> >> >> Hope it’s helpful, >> >> >> >> Adrian >> >> >> >> *From:* Adrian Conlon [mailto:[email protected]] >> *Sent:* 12 January 2015 15:35 >> *To:* [email protected] >> *Subject:* RE: Error: Unexpected jobqueue status >> >> >> >> Thanks Karl. Restarting the job manually did fix the problem. >> >> >> >> I might add a check for this in my software and kick the job into life >> again automatically, now I know it works… >> >> >> >> Adrian >> >> >> >> *From:* Karl Wright [mailto:[email protected] <[email protected]>] >> *Sent:* 12 January 2015 11:34 >> *To:* [email protected] >> *Subject:* Re: Error: Unexpected jobqueue status >> >> >> >> Hi Adrian, >> >> Just restarting the job should be sufficient to get it sorted out after >> this kind of failure. >> >> Karl >> >> >> >> On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]> >> wrote: >> >> Thanks Karl. >> >> >> >> With regards the runtime environment, apologies for the omission of >> Postgresql version. It’s v9.3. >> >> >> >> For the stack trace, I’ve just installed a jdk on the problematic server >> and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the >> next time agents process doesn’t respond to a stop request. >> >> >> >> With regards jobs that have this unexpected “jobqueue” status; do they >> sort themselves out the next time the job runs? Is there anything I should >> do to “help” the job along? >> >> >> >> Adrian >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* 12 January 2015 00:27 >> *To:* [email protected] >> *Subject:* Re: Error: Unexpected jobqueue status >> >> >> >> Also, if you are having trouble shutting down the agents process, it >> would be great if you could get a thread dump and post it, before you kill >> -9 it. >> >> Karl >> >> >> >> On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote: >> >> Hi Adrian, >> >> If you noted the comment stream in CONNECTORS-590, I was able to >> demonstrate conclusively that the problem was in Postgres. I have not seen >> the problem in 9.3, but that does not mean it's gone. What version of >> Postgresql are you using? >> >> In any case, while this problem definitely terminates your job, it will >> not happen very often. I suspect the frequency of occurrence may depend on >> how loaded the database is. >> >> Karl >> >> >> >> On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]> >> wrote: >> >> Hi All, >> >> >> >> I’m getting an occurrence of what looks very similar to CONNECTORS-590. >> >> >> >> The circumstances are: >> >> >> >> 1) MCF Jobs proceeding very slowly (looks like a Postgresql vacuum >> is needed) >> >> 2) Stop tomcat >> >> 3) Attempt to stop the agents normally >> >> 4) Wait a minute or two >> >> 5) Decide to “kill -9” the agents process >> >> 6) Vacuum the database >> >> 7) Restart tomcat >> >> 8) Restart the agents >> >> >> >> When I checked the job status page, I found that two of the jobs (out >> around 4000 or so) had the following status (or very similar): >> >> >> >> Error: Unexpected jobqueue status - record id 1417115392831, expecting >> active status, saw 4 >> >> >> >> Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2), >> using postresql as the crawl database and running on Ubuntu Linux. I’m >> using zookeeper style synchronisation. >> >> >> >> Let me know if more information etc. is needed or if you think it’s a >> new/real issue. >> >> Adrian >> >> ____________________________________________________________ >> Electronic mail messages entering and leaving Arup business >> systems are scanned for acceptability of content and viruses >> >> >> >> >> >> >> >> >> > >
