Re: Error: Unexpected jobqueue status

Karl Wright Wed, 04 Feb 2015 07:35:14 -0800

Hi Adrian,

Clearly, the shutdown is happening at a time when ManifoldCF is in the
midst of an ANALYZE TABLE operation.  The shutdown is attempting to
interrupt this operation, and is probably succeeding, but since it is
happening when a transaction is being closed, it apparently winds up
retrying the whole thing at that point -- which goes right back into
ANALYZE, only to be interrupted again one second later.


I still can't see why it retries, but I should be able to analyze that much
from here.  I'll create a ticket for the problem.

Karl


On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]>
wrote:

>  At that point, only a few (three or so) minutes.
>
>
>
> I left it for another 5 or 6 six minutes after I grabbed the stack trace
> before I finally ‘-9’ed it.
>
>
>
> HTH,
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 04 February 2015 15:21
>
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Thanks for the trace.
>
> Here's what I get out of it:
>
> The shutdown thread is waiting for all the threads to terminate:
>
> >>>>>>
> "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in
> Object.wait() [0x00007fe710116000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048)
>     - locked <0x00000000e70cf2e8> (a java.lang.Integer)
>     at
> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>     at
> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>     at
> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>     - locked <0x00000000eafa3460> (a java.util.HashMap)
>     at
> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340)
>     - locked <0x00000000eafcb208> (a java.util.ArrayList)
>     - locked <0x00000000eafcb2a0> (a java.lang.Integer)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565)
> <<<<<<
>
> There are a lot of Zookeeper threads still alive, but those don't matter
> here.  There is precisely one thread that is blocking shutdown:
>
> >>>>>>
> "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in
> Object.wait() [0x00007fe71f7f6000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     at java.lang.Thread.join(Thread.java:1281)
>     - locked <0x00000000e851a0b8> (a
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread)
>     at java.lang.Thread.join(Thread.java:1355)
>     at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
>     at
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>     at
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790)
>     at
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>     at
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>     at
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576)
>     at
> org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429)
>     at
> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231)
>     at
> org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575)
>     at
> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238)
> <<<<<<
>
> This is performing a database ANALYZE TABLE operation, which *should* be
> interruptible, given the trace, but is apparently *not*.  I'll have to look
> at this to find a reason why it won't interrupt.  Offhand, I can see no
> reason for it.
>
> When you got this thread dump, how long had the system been waiting to
> shut down?
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Hi Karl,
>
>
>
> It’s taken me a little while before I needed to do some work on the
> server, but here’s a ‘jstack’ dump of an agents run that’s not responding
> to a stop request.
>
>
>
> Hope it’s helpful,
>
>
>
> Adrian
>
>
>
> *From:* Adrian Conlon [mailto:[email protected]]
> *Sent:* 12 January 2015 15:35
> *To:* [email protected]
> *Subject:* RE: Error: Unexpected jobqueue status
>
>
>
> Thanks Karl.  Restarting the job manually did fix the problem.
>
>
>
> I might add a check for this in my software and kick the job into life
> again automatically, now I know it works…
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected] <[email protected]>]
> *Sent:* 12 January 2015 11:34
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Hi Adrian,
>
> Just restarting the job should be sufficient to get it sorted out after
> this kind of failure.
>
> Karl
>
>
>
> On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Thanks Karl.
>
>
>
> With regards the runtime environment, apologies for the omission of
> Postgresql version.  It’s v9.3.
>
>
>
> For the stack trace, I’ve just installed a jdk on the problematic server
> and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the
> next time agents process doesn’t respond to a stop request.
>
>
>
> With regards jobs that have this unexpected “jobqueue” status; do they
> sort themselves out the next time the job runs?  Is there anything I should
> do to “help” the job along?
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 12 January 2015 00:27
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Also, if you are having trouble shutting down the agents process, it would
> be great if you could get a thread dump and post it, before you kill -9 it.
>
> Karl
>
>
>
> On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote:
>
>  Hi Adrian,
>
> If you noted the comment stream in CONNECTORS-590, I was able to
> demonstrate conclusively that the problem was in Postgres.  I have not seen
> the problem in 9.3, but that does not mean it's gone.  What version of
> Postgresql are you using?
>
> In any case, while this problem definitely terminates your job, it will
> not happen very often.  I suspect the frequency of occurrence may depend on
> how loaded the database is.
>
> Karl
>
>
>
> On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]>
> wrote:
>
>  Hi All,
>
>
>
> I’m getting an occurrence of what looks very similar to CONNECTORS-590.
>
>
>
> The circumstances are:
>
>
>
> 1)      MCF Jobs proceeding very slowly (looks like a Postgresql vacuum
> is needed)
>
> 2)      Stop tomcat
>
> 3)      Attempt to stop the agents normally
>
> 4)      Wait a minute or two
>
> 5)      Decide to “kill -9” the agents process
>
> 6)      Vacuum the database
>
> 7)      Restart tomcat
>
> 8)      Restart the agents
>
>
>
> When I checked the job status page, I found that two of the jobs (out
> around 4000 or so) had the following status (or very similar):
>
>
>
> Error: Unexpected jobqueue status - record id 1417115392831, expecting
> active status, saw 4
>
>
>
> Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2),
> using postresql as the crawl database and running on Ubuntu Linux.  I’m
> using zookeeper style synchronisation.
>
>
>
> Let me know if more information etc. is needed or if you think it’s a
> new/real issue.
>
> Adrian
>
> ____________________________________________________________
> Electronic mail messages entering and leaving Arup  business
> systems are scanned for acceptability of content and viruses
>
>
>
>
>
>
>
>
>

Re: Error: Unexpected jobqueue status

Reply via email to