Hi Adrian, The patch is in mcf-core, so that should be fine.
Karl On Thu, Feb 5, 2015 at 12:42 PM, Adrian Conlon <[email protected]> wrote: > Hi Karl, > > > > I’ve applied the logging patch from #1156 and uploaded onto my running > server. > > > > I presume I need to wait until the problem would recur before the effects > of the patch would be seen. Is that correct? > > > > Also, for ease of deployment (I was just being lazy really!), I only > uploaded mcf-core.jar and mcf-agents.jar. To confirm, that would include > your patch, wouldn’t it? > > > > Thanks, > > > > Adrian > > > > *From:* Adrian Conlon [mailto:[email protected]] > *Sent:* 04 February 2015 17:24 > > *To:* [email protected] > *Subject:* RE: Error: Unexpected jobqueue status > > > > Thanks Karl, > > > > I’ll build/deploy tomorrow and keep you posted. > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected] <[email protected]>] > *Sent:* 04 February 2015 17:14 > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Hi Adrian, > > I've attached the patch to the CONNECTORS-1156 ticket. > > Thanks! > Karl > > > > On Wed, Feb 4, 2015 at 12:05 PM, Adrian Conlon <[email protected]> > wrote: > > I’ve built MCF from source in the past with patches applied, so I’m > happy to do so again. > > > > Let me know when you’ve got a patch for me to try… > > > > Thanks, > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 04 February 2015 16:59 > > > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Hi Adrian, > > That's fine. However, if you have a version of MCF you've built yourself, > maybe we could add instrumentation to the code which would help resolve > exactly what's happening. Are you open to this? > > Karl > > > > On Wed, Feb 4, 2015 at 11:27 AM, Adrian Conlon <[email protected]> > wrote: > > Oh, sorry Karl: I feel like an idiot. The cleanup process I run removes > any outstanding log files for MCF (amongst other things), so those files > are gone. > > > > It usually takes a couple of weeks for the agents to get into a state that > causes these hangs (I think it’s when the database needs vacuuming), so the > best I can offer is to try again in a couple of weeks. > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 04 February 2015 16:09 > > > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > It's also possible that there are no log messages from the time of the > shutdown attempt -- that too would be good to know. That would mean that > the Postgresql driver is interpreting interruption with a specific > SQLException code. In that case we may need to instrument the code in > order to determine what the PostgreSQL driver is up to, and it may be > difficult to distinguish between thread interruption and a legitimate SQL > error. > > Karl > > > > On Wed, Feb 4, 2015 at 10:53 AM, Karl Wright <[email protected]> wrote: > > Hi Adrian, > > There should be log messages corresponding to this line: > > >>>>>> > // Log it, but keep the thread alive > Logging.threads.error("Exception tossed: "+e.getMessage(),e); > <<<<<< > > Can you find any of those, and if so, please send me the logged exception? > > Thanks! > Karl > > > > On Wed, Feb 4, 2015 at 10:50 AM, Adrian Conlon <[email protected]> > wrote: > > Thanks Karl, I’m glad I found you an interesting one! > > > > Reasonably fast agent shutdown would definitely be useful for scheduled DB > maintenance tasks… > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 04 February 2015 15:38 > > > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > CONNECTORS-1156. > > Karl > > > > On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright <[email protected]> wrote: > > Hi Adrian, > > Clearly, the shutdown is happening at a time when ManifoldCF is in the > midst of an ANALYZE TABLE operation. The shutdown is attempting to > interrupt this operation, and is probably succeeding, but since it is > happening when a transaction is being closed, it apparently winds up > retrying the whole thing at that point -- which goes right back into > ANALYZE, only to be interrupted again one second later. > > I still can't see why it retries, but I should be able to analyze that > much from here. I'll create a ticket for the problem. > > Karl > > > > On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <[email protected]> > wrote: > > At that point, only a few (three or so) minutes. > > > > I left it for another 5 or 6 six minutes after I grabbed the stack trace > before I finally ‘-9’ed it. > > > > HTH, > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 04 February 2015 15:21 > > > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Thanks for the trace. > > Here's what I get out of it: > > The shutdown thread is waiting for all the threads to terminate: > > >>>>>> > "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in > Object.wait() [0x00007fe710116000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048) > - locked <0x00000000e70cf2e8> (a java.lang.Integer) > at > org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) > at > org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) > at > org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) > - locked <0x00000000eafa3460> (a java.util.HashMap) > at > org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) > at > org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340) > - locked <0x00000000eafcb208> (a java.util.ArrayList) > - locked <0x00000000eafcb2a0> (a java.lang.Integer) > at > org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565) > <<<<<< > > There are a lot of Zookeeper threads still alive, but those don't matter > here. There is precisely one thread that is blocking shutdown: > > >>>>>> > "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in > Object.wait() [0x00007fe71f7f6000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x00000000e851a0b8> (a > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread) > at java.lang.Thread.join(Thread.java:1355) > at > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694) > at > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728) > at > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790) > at > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444) > at > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146) > at > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576) > at > org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429) > at > org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414) > at > org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231) > at > org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575) > at > org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238) > <<<<<< > > This is performing a database ANALYZE TABLE operation, which *should* be > interruptible, given the trace, but is apparently *not*. I'll have to look > at this to find a reason why it won't interrupt. Offhand, I can see no > reason for it. > > When you got this thread dump, how long had the system been waiting to > shut down? > > Karl > > > > On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <[email protected]> > wrote: > > Hi Karl, > > > > It’s taken me a little while before I needed to do some work on the > server, but here’s a ‘jstack’ dump of an agents run that’s not responding > to a stop request. > > > > Hope it’s helpful, > > > > Adrian > > > > *From:* Adrian Conlon [mailto:[email protected]] > *Sent:* 12 January 2015 15:35 > *To:* [email protected] > *Subject:* RE: Error: Unexpected jobqueue status > > > > Thanks Karl. Restarting the job manually did fix the problem. > > > > I might add a check for this in my software and kick the job into life > again automatically, now I know it works… > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected] <[email protected]>] > *Sent:* 12 January 2015 11:34 > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Hi Adrian, > > Just restarting the job should be sufficient to get it sorted out after > this kind of failure. > > Karl > > > > On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <[email protected]> > wrote: > > Thanks Karl. > > > > With regards the runtime environment, apologies for the omission of > Postgresql version. It’s v9.3. > > > > For the stack trace, I’ve just installed a jdk on the problematic server > and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the > next time agents process doesn’t respond to a stop request. > > > > With regards jobs that have this unexpected “jobqueue” status; do they > sort themselves out the next time the job runs? Is there anything I should > do to “help” the job along? > > > > Adrian > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* 12 January 2015 00:27 > *To:* [email protected] > *Subject:* Re: Error: Unexpected jobqueue status > > > > Also, if you are having trouble shutting down the agents process, it would > be great if you could get a thread dump and post it, before you kill -9 it. > > Karl > > > > On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <[email protected]> wrote: > > Hi Adrian, > > If you noted the comment stream in CONNECTORS-590, I was able to > demonstrate conclusively that the problem was in Postgres. I have not seen > the problem in 9.3, but that does not mean it's gone. What version of > Postgresql are you using? > > In any case, while this problem definitely terminates your job, it will > not happen very often. I suspect the frequency of occurrence may depend on > how loaded the database is. > > Karl > > > > On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <[email protected]> > wrote: > > Hi All, > > > > I’m getting an occurrence of what looks very similar to CONNECTORS-590. > > > > The circumstances are: > > > > 1) MCF Jobs proceeding very slowly (looks like a Postgresql vacuum > is needed) > > 2) Stop tomcat > > 3) Attempt to stop the agents normally > > 4) Wait a minute or two > > 5) Decide to “kill -9” the agents process > > 6) Vacuum the database > > 7) Restart tomcat > > 8) Restart the agents > > > > When I checked the job status page, I found that two of the jobs (out > around 4000 or so) had the following status (or very similar): > > > > Error: Unexpected jobqueue status - record id 1417115392831, expecting > active status, saw 4 > > > > Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2), > using postresql as the crawl database and running on Ubuntu Linux. I’m > using zookeeper style synchronisation. > > > > Let me know if more information etc. is needed or if you think it’s a > new/real issue. > > Adrian > > ____________________________________________________________ > Electronic mail messages entering and leaving Arup business > systems are scanned for acceptability of content and viruses > > > > > > > > > > > > > > > > > > > > >
