RE: Error: Unexpected jobqueue status

Adrian Conlon Thu, 05 Feb 2015 09:45:27 -0800

Hi Karl,

I’ve applied the logging patch from #1156 and uploaded onto my running server.

I presume I need to wait until the problem would recur before the effects of 
the patch would be seen.  Is that correct?

Also, for ease of deployment (I was just being lazy really!), I only uploaded 
mcf-core.jar and mcf-agents.jar.  To confirm, that would include your patch, 
wouldn’t it?

Thanks,

Adrian

From: Adrian Conlon [mailto:[email protected]]
Sent: 04 February 2015 17:24
To: [email protected]
Subject: RE: Error: Unexpected jobqueue status

Thanks Karl,

I’ll build/deploy tomorrow and keep you posted.

Adrian

From: Karl Wright [mailto:[email protected]]
Sent: 04 February 2015 17:14
To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

Hi Adrian,

I've attached the patch to the CONNECTORS-1156 ticket.

Thanks!
Karl

On Wed, Feb 4, 2015 at 12:05 PM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
I’ve built MCF from source in the past with patches applied, so I’m happy to do 
so again.

Let me know when you’ve got a patch for me to try…

Thanks,

Adrian

From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 04 February 2015 16:59

To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

Hi Adrian,

That's fine.  However, if you have a version of MCF you've built yourself, 
maybe we could add instrumentation to the code which would help resolve exactly 
what's happening.  Are you open to this?

Karl

On Wed, Feb 4, 2015 at 11:27 AM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
Oh, sorry Karl: I feel like an idiot.  The cleanup process I run removes any 
outstanding log files for MCF (amongst other things), so those files are gone.

It usually takes a couple of weeks for the agents to get into a state that 
causes these hangs (I think it’s when the database needs vacuuming), so the 
best I can offer is to try again in a couple of weeks.

Adrian

From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 04 February 2015 16:09

To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

It's also possible that there are no log messages from the time of the shutdown 
attempt -- that too would be good to know.  That would mean that the Postgresql 
driver is interpreting interruption with a specific SQLException code.  In that 
case we may need to instrument the code in order to determine what the 
PostgreSQL driver is up to, and it may be difficult to distinguish between 
thread interruption and a legitimate SQL error.

Karl

On Wed, Feb 4, 2015 at 10:53 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Adrian,

There should be log messages corresponding to this line:

>>>>>>
          // Log it, but keep the thread alive
          Logging.threads.error("Exception tossed: "+e.getMessage(),e);
<<<<<<
Can you find any of those, and if so, please send me the logged exception?

Thanks!
Karl

On Wed, Feb 4, 2015 at 10:50 AM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Karl, I’m glad I found you an interesting one!

Reasonably fast agent shutdown would definitely be useful for scheduled DB 
maintenance tasks…

Adrian

From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 04 February 2015 15:38

To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

CONNECTORS-1156.

Karl

On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Adrian,
Clearly, the shutdown is happening at a time when ManifoldCF is in the midst of 
an ANALYZE TABLE operation.  The shutdown is attempting to interrupt this 
operation, and is probably succeeding, but since it is happening when a 
transaction is being closed, it apparently winds up retrying the whole thing at 
that point -- which goes right back into ANALYZE, only to be interrupted again 
one second later.
I still can't see why it retries, but I should be able to analyze that much 
from here.  I'll create a ticket for the problem.
Karl

On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
At that point, only a few (three or so) minutes.

I left it for another 5 or 6 six minutes after I grabbed the stack trace before 
I finally ‘-9’ed it.

HTH,

Adrian

From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 04 February 2015 15:21

To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

Thanks for the trace.
Here's what I get out of it:
The shutdown thread is waiting for all the threads to terminate:

>>>>>>
"Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in Object.wait() 
[0x00007fe710116000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048)
    - locked <0x00000000e70cf2e8> (a java.lang.Integer)
    at 
org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
    at 
org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
    at 
org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
    - locked <0x00000000eafa3460> (a java.util.HashMap)
    at 
org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
    at 
org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340)
    - locked <0x00000000eafcb208> (a java.util.ArrayList)
    - locked <0x00000000eafcb2a0> (a java.lang.Integer)
    at 
org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565)
<<<<<<
There are a lot of Zookeeper threads still alive, but those don't matter here.  
There is precisely one thread that is blocking shutdown:

>>>>>>
"Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in 
Object.wait() [0x00007fe71f7f6000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    - locked <0x00000000e851a0b8> (a 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread)
    at java.lang.Thread.join(Thread.java:1355)
    at 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
    at 
org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
    at 
org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790)
    at 
org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
    at 
org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
    at 
org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
    at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656)
    at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431)
    at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576)
    at 
org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429)
    at 
org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414)
    at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231)
    at 
org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575)
    at 
org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238)
<<<<<<
This is performing a database ANALYZE TABLE operation, which *should* be 
interruptible, given the trace, but is apparently *not*.  I'll have to look at 
this to find a reason why it won't interrupt.  Offhand, I can see no reason for 
it.

When you got this thread dump, how long had the system been waiting to shut 
down?

Karl

On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,

It’s taken me a little while before I needed to do some work on the server, but 
here’s a ‘jstack’ dump of an agents run that’s not responding to a stop request.

Hope it’s helpful,

Adrian

From: Adrian Conlon 
[mailto:[email protected]<mailto:[email protected]>]
Sent: 12 January 2015 15:35
To: [email protected]<mailto:[email protected]>
Subject: RE: Error: Unexpected jobqueue status

Thanks Karl.  Restarting the job manually did fix the problem.

I might add a check for this in my software and kick the job into life again 
automatically, now I know it works…

Adrian

From: Karl Wright [mailto:[email protected]]
Sent: 12 January 2015 11:34
To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

Hi Adrian,
Just restarting the job should be sufficient to get it sorted out after this 
kind of failure.

Karl

On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Karl.

With regards the runtime environment, apologies for the omission of Postgresql 
version.  It’s v9.3.

For the stack trace, I’ve just installed a jdk on the problematic server and 
tried out “jstack” (that’s a neat tool!), so I’m all systems go for the next 
time agents process doesn’t respond to a stop request.

With regards jobs that have this unexpected “jobqueue” status; do they sort 
themselves out the next time the job runs?  Is there anything I should do to 
“help” the job along?

Adrian

From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 12 January 2015 00:27
To: [email protected]<mailto:[email protected]>
Subject: Re: Error: Unexpected jobqueue status

Also, if you are having trouble shutting down the agents process, it would be 
great if you could get a thread dump and post it, before you kill -9 it.

Karl

On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Adrian,
If you noted the comment stream in CONNECTORS-590, I was able to demonstrate 
conclusively that the problem was in Postgres.  I have not seen the problem in 
9.3, but that does not mean it's gone.  What version of Postgresql are you 
using?

In any case, while this problem definitely terminates your job, it will not 
happen very often.  I suspect the frequency of occurrence may depend on how 
loaded the database is.

Karl

On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I’m getting an occurrence of what looks very similar to CONNECTORS-590.

The circumstances are:

1)      MCF Jobs proceeding very slowly (looks like a Postgresql vacuum is 
needed)

2)      Stop tomcat

3)      Attempt to stop the agents normally

4)      Wait a minute or two

5)      Decide to “kill -9” the agents process

6)      Vacuum the database

7)      Restart tomcat

8)      Restart the agents

When I checked the job status page, I found that two of the jobs (out around 
4000 or so) had the following status (or very similar):

Error: Unexpected jobqueue status - record id 1417115392831, expecting active 
status, saw 4

Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2), using 
postresql as the crawl database and running on Ubuntu Linux.  I’m using 
zookeeper style synchronisation.

Let me know if more information etc. is needed or if you think it’s a new/real 
issue.
Adrian

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

RE: Error: Unexpected jobqueue status

Reply via email to