Re: Error: Unexpected jobqueue status

Karl Wright Thu, 05 Feb 2015 10:25:18 -0800

Hi Adrian,

The patch is in mcf-core, so that should be fine.


Karl


On Thu, Feb 5, 2015 at 12:42 PM, Adrian Conlon <[email protected]>
wrote:

>  Hi Karl,
>
>
>
> I’ve applied the logging patch from #1156 and uploaded onto my running
> server.
>
>
>
> I presume I need to wait until the problem would recur before the effects
> of the patch would be seen.  Is that correct?
>
>
>
> Also, for ease of deployment (I was just being lazy really!), I only
> uploaded mcf-core.jar and mcf-agents.jar.  To confirm, that would include
> your patch, wouldn’t it?
>
>
>
> Thanks,
>
>
>
> Adrian
>
>
>
> *From:* Adrian Conlon [mailto:[email protected]]
> *Sent:* 04 February 2015 17:24
>
> *To:* [email protected]
> *Subject:* RE: Error: Unexpected jobqueue status
>
>
>
> Thanks Karl,
>
>
>
> I’ll build/deploy tomorrow and keep you posted.
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected] <[email protected]>]
> *Sent:* 04 February 2015 17:14
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Hi Adrian,
>
> I've attached the patch to the CONNECTORS-1156 ticket.
>
> Thanks!
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 12:05 PM, Adrian Conlon <[email protected]>
> wrote:
>
>  I’ve built MCF from source in the past with patches applied, so I’m
> happy to do so again.
>
>
>
> Let me know when you’ve got a patch for me to try…
>
>
>
> Thanks,
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 04 February 2015 16:59
>
>
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Hi Adrian,
>
> That's fine.  However, if you have a version of MCF you've built yourself,
> maybe we could add instrumentation to the code which would help resolve
> exactly what's happening.  Are you open to this?
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 11:27 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Oh, sorry Karl: I feel like an idiot.  The cleanup process I run removes
> any outstanding log files for MCF (amongst other things), so those files
> are gone.
>
>
>
> It usually takes a couple of weeks for the agents to get into a state that
> causes these hangs (I think it’s when the database needs vacuuming), so the
> best I can offer is to try again in a couple of weeks.
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 04 February 2015 16:09
>
>
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> It's also possible that there are no log messages from the time of the
> shutdown attempt -- that too would be good to know.  That would mean that
> the Postgresql driver is interpreting interruption with a specific
> SQLException code.  In that case we may need to instrument the code in
> order to determine what the PostgreSQL driver is up to, and it may be
> difficult to distinguish between thread interruption and a legitimate SQL
> error.
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 10:53 AM, Karl Wright <[email protected]> wrote:
>
>  Hi Adrian,
>
> There should be log messages corresponding to this line:
>
> >>>>>>
>           // Log it, but keep the thread alive
>           Logging.threads.error("Exception tossed: "+e.getMessage(),e);
> <<<<<<
>
> Can you find any of those, and if so, please send me the logged exception?
>
> Thanks!
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 10:50 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Thanks Karl, I’m glad I found you an interesting one!
>
>
>
> Reasonably fast agent shutdown would definitely be useful for scheduled DB
> maintenance tasks…
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 04 February 2015 15:38
>
>
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> CONNECTORS-1156.
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright <[email protected]> wrote:
>
>   Hi Adrian,
>
> Clearly, the shutdown is happening at a time when ManifoldCF is in the
> midst of an ANALYZE TABLE operation.  The shutdown is attempting to
> interrupt this operation, and is probably succeeding, but since it is
> happening when a transaction is being closed, it apparently winds up
> retrying the whole thing at that point -- which goes right back into
> ANALYZE, only to be interrupted again one second later.
>
> I still can't see why it retries, but I should be able to analyze that
> much from here.  I'll create a ticket for the problem.
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  At that point, only a few (three or so) minutes.
>
>
>
> I left it for another 5 or 6 six minutes after I grabbed the stack trace
> before I finally ‘-9’ed it.
>
>
>
> HTH,
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 04 February 2015 15:21
>
>
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Thanks for the trace.
>
> Here's what I get out of it:
>
> The shutdown thread is waiting for all the threads to terminate:
>
> >>>>>>
> "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in
> Object.wait() [0x00007fe710116000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048)
>     - locked <0x00000000e70cf2e8> (a java.lang.Integer)
>     at
> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>     at
> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>     at
> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>     - locked <0x00000000eafa3460> (a java.util.HashMap)
>     at
> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340)
>     - locked <0x00000000eafcb208> (a java.util.ArrayList)
>     - locked <0x00000000eafcb2a0> (a java.lang.Integer)
>     at
> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565)
> <<<<<<
>
> There are a lot of Zookeeper threads still alive, but those don't matter
> here.  There is precisely one thread that is blocking shutdown:
>
> >>>>>>
> "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in
> Object.wait() [0x00007fe71f7f6000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     at java.lang.Thread.join(Thread.java:1281)
>     - locked <0x00000000e851a0b8> (a
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread)
>     at java.lang.Thread.join(Thread.java:1355)
>     at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
>     at
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>     at
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790)
>     at
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>     at
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>     at
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576)
>     at
> org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429)
>     at
> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414)
>     at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231)
>     at
> org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575)
>     at
> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238)
> <<<<<<
>
> This is performing a database ANALYZE TABLE operation, which *should* be
> interruptible, given the trace, but is apparently *not*.  I'll have to look
> at this to find a reason why it won't interrupt.  Offhand, I can see no
> reason for it.
>
> When you got this thread dump, how long had the system been waiting to
> shut down?
>
> Karl
>
>
>
> On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Hi Karl,
>
>
>
> It’s taken me a little while before I needed to do some work on the
> server, but here’s a ‘jstack’ dump of an agents run that’s not responding
> to a stop request.
>
>
>
> Hope it’s helpful,
>
>
>
> Adrian
>
>
>
> *From:* Adrian Conlon [mailto:[email protected]]
> *Sent:* 12 January 2015 15:35
> *To:* [email protected]
> *Subject:* RE: Error: Unexpected jobqueue status
>
>
>
> Thanks Karl.  Restarting the job manually did fix the problem.
>
>
>
> I might add a check for this in my software and kick the job into life
> again automatically, now I know it works…
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected] <[email protected]>]
> *Sent:* 12 January 2015 11:34
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Hi Adrian,
>
> Just restarting the job should be sufficient to get it sorted out after
> this kind of failure.
>
> Karl
>
>
>
> On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]>
> wrote:
>
>  Thanks Karl.
>
>
>
> With regards the runtime environment, apologies for the omission of
> Postgresql version.  It’s v9.3.
>
>
>
> For the stack trace, I’ve just installed a jdk on the problematic server
> and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the
> next time agents process doesn’t respond to a stop request.
>
>
>
> With regards jobs that have this unexpected “jobqueue” status; do they
> sort themselves out the next time the job runs?  Is there anything I should
> do to “help” the job along?
>
>
>
> Adrian
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* 12 January 2015 00:27
> *To:* [email protected]
> *Subject:* Re: Error: Unexpected jobqueue status
>
>
>
> Also, if you are having trouble shutting down the agents process, it would
> be great if you could get a thread dump and post it, before you kill -9 it.
>
> Karl
>
>
>
> On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote:
>
>  Hi Adrian,
>
> If you noted the comment stream in CONNECTORS-590, I was able to
> demonstrate conclusively that the problem was in Postgres.  I have not seen
> the problem in 9.3, but that does not mean it's gone.  What version of
> Postgresql are you using?
>
> In any case, while this problem definitely terminates your job, it will
> not happen very often.  I suspect the frequency of occurrence may depend on
> how loaded the database is.
>
> Karl
>
>
>
> On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]>
> wrote:
>
>  Hi All,
>
>
>
> I’m getting an occurrence of what looks very similar to CONNECTORS-590.
>
>
>
> The circumstances are:
>
>
>
> 1)      MCF Jobs proceeding very slowly (looks like a Postgresql vacuum
> is needed)
>
> 2)      Stop tomcat
>
> 3)      Attempt to stop the agents normally
>
> 4)      Wait a minute or two
>
> 5)      Decide to “kill -9” the agents process
>
> 6)      Vacuum the database
>
> 7)      Restart tomcat
>
> 8)      Restart the agents
>
>
>
> When I checked the job status page, I found that two of the jobs (out
> around 4000 or so) had the following status (or very similar):
>
>
>
> Error: Unexpected jobqueue status - record id 1417115392831, expecting
> active status, saw 4
>
>
>
> Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2),
> using postresql as the crawl database and running on Ubuntu Linux.  I’m
> using zookeeper style synchronisation.
>
>
>
> Let me know if more information etc. is needed or if you think it’s a
> new/real issue.
>
> Adrian
>
> ____________________________________________________________
> Electronic mail messages entering and leaving Arup  business
> systems are scanned for acceptability of content and viruses
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Error: Unexpected jobqueue status

Reply via email to