Re: Error: Unexpected jobqueue status

Karl Wright Wed, 04 Feb 2015 07:46:24 -0800

CONNECTORS-1156.

Karl



On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright <[email protected]> wrote:

> Hi Adrian,
>
> Clearly, the shutdown is happening at a time when ManifoldCF is in the
> midst of an ANALYZE TABLE operation.  The shutdown is attempting to
> interrupt this operation, and is probably succeeding, but since it is
> happening when a transaction is being closed, it apparently winds up
> retrying the whole thing at that point -- which goes right back into
> ANALYZE, only to be interrupted again one second later.
>
> I still can't see why it retries, but I should be able to analyze that
> much from here.  I'll create a ticket for the problem.
>
> Karl
>
>
> On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]>
> wrote:
>
>>  At that point, only a few (three or so) minutes.
>>
>>
>>
>> I left it for another 5 or 6 six minutes after I grabbed the stack trace
>> before I finally ‘-9’ed it.
>>
>>
>>
>> HTH,
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:[email protected]]
>> *Sent:* 04 February 2015 15:21
>>
>> *To:* [email protected]
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Thanks for the trace.
>>
>> Here's what I get out of it:
>>
>> The shutdown thread is waiting for all the threads to terminate:
>>
>> >>>>>>
>> "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in
>> Object.wait() [0x00007fe710116000]
>>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>     at java.lang.Object.wait(Native Method)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048)
>>     - locked <0x00000000e70cf2e8> (a java.lang.Integer)
>>     at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>>     at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>>     at
>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>>     - locked <0x00000000eafa3460> (a java.util.HashMap)
>>     at
>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340)
>>     - locked <0x00000000eafcb208> (a java.util.ArrayList)
>>     - locked <0x00000000eafcb2a0> (a java.lang.Integer)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565)
>> <<<<<<
>>
>> There are a lot of Zookeeper threads still alive, but those don't matter
>> here.  There is precisely one thread that is blocking shutdown:
>>
>> >>>>>>
>> "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in
>> Object.wait() [0x00007fe71f7f6000]
>>    java.lang.Thread.State: WAITING (on object monitor)
>>     at java.lang.Object.wait(Native Method)
>>     at java.lang.Thread.join(Thread.java:1281)
>>     - locked <0x00000000e851a0b8> (a
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread)
>>     at java.lang.Thread.join(Thread.java:1355)
>>     at
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790)
>>     at
>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>>     at
>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576)
>>     at
>> org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429)
>>     at
>> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231)
>>     at
>> org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575)
>>     at
>> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238)
>> <<<<<<
>>
>> This is performing a database ANALYZE TABLE operation, which *should* be
>> interruptible, given the trace, but is apparently *not*.  I'll have to look
>> at this to find a reason why it won't interrupt.  Offhand, I can see no
>> reason for it.
>>
>> When you got this thread dump, how long had the system been waiting to
>> shut down?
>>
>> Karl
>>
>>
>>
>> On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]>
>> wrote:
>>
>>  Hi Karl,
>>
>>
>>
>> It’s taken me a little while before I needed to do some work on the
>> server, but here’s a ‘jstack’ dump of an agents run that’s not responding
>> to a stop request.
>>
>>
>>
>> Hope it’s helpful,
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Adrian Conlon [mailto:[email protected]]
>> *Sent:* 12 January 2015 15:35
>> *To:* [email protected]
>> *Subject:* RE: Error: Unexpected jobqueue status
>>
>>
>>
>> Thanks Karl.  Restarting the job manually did fix the problem.
>>
>>
>>
>> I might add a check for this in my software and kick the job into life
>> again automatically, now I know it works…
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:[email protected] <[email protected]>]
>> *Sent:* 12 January 2015 11:34
>> *To:* [email protected]
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Hi Adrian,
>>
>> Just restarting the job should be sufficient to get it sorted out after
>> this kind of failure.
>>
>> Karl
>>
>>
>>
>> On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]>
>> wrote:
>>
>>  Thanks Karl.
>>
>>
>>
>> With regards the runtime environment, apologies for the omission of
>> Postgresql version.  It’s v9.3.
>>
>>
>>
>> For the stack trace, I’ve just installed a jdk on the problematic server
>> and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the
>> next time agents process doesn’t respond to a stop request.
>>
>>
>>
>> With regards jobs that have this unexpected “jobqueue” status; do they
>> sort themselves out the next time the job runs?  Is there anything I should
>> do to “help” the job along?
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:[email protected]]
>> *Sent:* 12 January 2015 00:27
>> *To:* [email protected]
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Also, if you are having trouble shutting down the agents process, it
>> would be great if you could get a thread dump and post it, before you kill
>> -9 it.
>>
>> Karl
>>
>>
>>
>> On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote:
>>
>>  Hi Adrian,
>>
>> If you noted the comment stream in CONNECTORS-590, I was able to
>> demonstrate conclusively that the problem was in Postgres.  I have not seen
>> the problem in 9.3, but that does not mean it's gone.  What version of
>> Postgresql are you using?
>>
>> In any case, while this problem definitely terminates your job, it will
>> not happen very often.  I suspect the frequency of occurrence may depend on
>> how loaded the database is.
>>
>> Karl
>>
>>
>>
>> On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]>
>> wrote:
>>
>>  Hi All,
>>
>>
>>
>> I’m getting an occurrence of what looks very similar to CONNECTORS-590.
>>
>>
>>
>> The circumstances are:
>>
>>
>>
>> 1)      MCF Jobs proceeding very slowly (looks like a Postgresql vacuum
>> is needed)
>>
>> 2)      Stop tomcat
>>
>> 3)      Attempt to stop the agents normally
>>
>> 4)      Wait a minute or two
>>
>> 5)      Decide to “kill -9” the agents process
>>
>> 6)      Vacuum the database
>>
>> 7)      Restart tomcat
>>
>> 8)      Restart the agents
>>
>>
>>
>> When I checked the job status page, I found that two of the jobs (out
>> around 4000 or so) had the following status (or very similar):
>>
>>
>>
>> Error: Unexpected jobqueue status - record id 1417115392831, expecting
>> active status, saw 4
>>
>>
>>
>> Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2),
>> using postresql as the crawl database and running on Ubuntu Linux.  I’m
>> using zookeeper style synchronisation.
>>
>>
>>
>> Let me know if more information etc. is needed or if you think it’s a
>> new/real issue.
>>
>> Adrian
>>
>> ____________________________________________________________
>> Electronic mail messages entering and leaving Arup  business
>> systems are scanned for acceptability of content and viruses
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Error: Unexpected jobqueue status

Reply via email to