Hi Guy, The agents thread dump shows that there's a lock stuck from somewhere; I expect it's from the UI. Next time this happens, could you get a thread dump for the UI process as well as from the agents process? Thanks!!
Karl On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <[email protected]> wrote: > Hi Guy, > > See https://issues.apache.org/jira/browse/CONNECTORS-590. > > When you see "unexpected" this is not a good sign: > > >>>>>> > > “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: > Unexpected jobqueue status - record id 1488898668325, expecting active > status, saw 4 > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected > jobqueue status - record id 1488898668325, expecting active status, saw 4 > > at org.apache.manifoldcf.crawler. > jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019) > > at org.apache.manifoldcf.crawler. > jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271) > > at org.apache.manifoldcf.crawler. > system.WorkerThread.run(WorkerThread.java:710) > > <<<<<< > > > I've spent weeks chasing this issue in the past. I have not seen it at > all except in certain installations: 2 of them, counting yours. I > introduced transaction forensics into the MCF code to figure out what was > going on and I definitively proved (on Postgresql 9.1) that we were seeing > a transactional integrity problem with Postgresql itself. A ticket against > Postgresql was not logged because they'd need a reproducible test case and > also the latest version, and it didn't happen on 9.3 at the time. > > You also seem to be seeing a deadlock in MCF locks as well. When the > Postgresql bug happens it means that the database is in a state that MCF > really can't figure out and thus doesn't know how to deal with, so this may > just be a downstream result of that occurrence. But I can't be sure > without further analysis. > > I'm very curious now about the details of your setup. (1) How is your > postgresql set up? What version did you decide to use? (2) How many > agents processes do you have? Just one? (3) What OS? (4) What JDK? > > Other things you can try: > (1) Running the postgresql LT tests (ant run-LT-postgresql) against your > postgresql installation; you will need to change the test code itself to > allow it to create an instance for testing in that case; > (2) Turning on database transaction forensics (property name > "org.apache.manifoldcf.diagnostics" value "DEBUG") > (3) Turning on lock debugging (property name "org.apache.manifoldcf.lock" > value "DEBUG") > > Transaction forensics uses memory but only generates lots of output when > something goes wrong, so that may be the best place to start. Turning on > lock debugging will generate a large amount of log output but may be > necessary, especially if we want to figure out if the lock issue is a > downstream problem caused by the database integrity issue. > > > Thanks, > Karl > > > On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <[email protected]> > wrote: > >> Hi Karl, >> >> I have upgraded to MCF 2.6 on Windows using Zookeeper for >> synchronisation and PostgreSQL 9.3.16 for the database. >> >> I re-ran the pair of jobs ( 1 web crawl of local intranet site every 15 >> minutes and 1 crawl of JDBC connected database every 3 minutes) that >> would previously lock up after a couple of iterations. These seemed to >> run much better together but after 6 hours running there was a lock up for >> both these jobs (both indicating they were starting up). >> >> >> >> Attempting to restart or abort the jobs through the MCF webapp failed. >> >> >> >> After restarting the MCF agents process all sprang into life and the jobs >> aborted/restarted as expected. >> >> >> >> In the MCF log the following error was found >> >> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: >> Unexpected jobqueue status - record id 1488898668325, expecting active >> status, saw 4 >> >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected >> jobqueue status - record id 1488898668325, expecting active status, saw 4 >> >> at org.apache.manifoldcf.crawler. >> jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019) >> >> at org.apache.manifoldcf.crawler. >> jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271) >> >> at org.apache.manifoldcf.crawler. >> system.WorkerThread.run(WorkerThread.java:710) >> >> WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption >> reported for job 1488898090224 connection 'web': Job no longer active >> >> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't >> release lock we don't hold >> >> java.lang.IllegalStateException: Can't release lock we don't hold” >> >> >> >> >> >> ManifoldCF does not seem to have recovered from this error and the jobs >> can’t be re-started without restarting ManifoldCF agents. >> >> >> >> The manifoldcf, PostgreSQL and Agents JVM stack trace are attached >> >> >> >> Best Regards, >> >> >> >> Guy >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* 06 March 2017 09:24 >> *To:* [email protected] >> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >> >> >> >> Hi Guy, >> >> >> >> (1) I have no experience with PostgresSQL versions beyond 9.3, but I >> doubt you would have problems. >> >> (2) If you are using multiple processes, even if there's only one agents >> process, you must use synchronization. I would recommend Zookeeper; >> file-system-based synchronization is deprecated. >> >> (3) Windows has many ways of interfering with file-based sync, including >> path-length issues. I have seen Windows fail to unlock files and need a >> reboot to release the lock. This is one reason why file-system-based >> locking is deprecated. >> >> >> >> Thanks, >> Karl >> >> >> >> >> >> On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <[email protected]> >> wrote: >> >> Hi Karl, >> >> Thanks for that I will try version 2.6. Whilst moving to >> MCF 2.6 I would potentially like to upgrade my backend PostgreSQL version >> from 9.3.5. >> >> 1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6 >> e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2? >> >> 2) For a production system on a single server running a single MCF agents >> process would you recommend the file based synchronisation locking or >> zookeeper based synchronisation locking. With the file based >> synchronisation locking mechanism I have sometimes seen errors of the form : >> >> 'D:\Apps\ManifoldCF\apache-man >> ifoldcf-2.0.1\multiprocess-file-example\.\.\syncharea\475\ >> 708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr COLL1 osp_unstruct.lock' >> failed: Access is denied’ ( I have ensured that the SYNCHAREA is not >> scanned by AV or Indexed by Windows Search and all MCF processes run as >> the same user) >> >> What could cause these errors? >> >> >> >> Many Thanks, >> >> >> >> Guy >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* 03 March 2017 17:35 >> >> >> *To:* [email protected] >> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >> >> >> >> Hi Guy: >> >> >> >> It is expected that sometimes database deadlock will develop, and the >> transaction will need to be retried. There is code in MCF for doing this: >> >> >> >> >>>>>> >> >> if (sqlState != null && sqlState.equals("40001")) >> >> return new ManifoldCFException(message,e, >> ManifoldCFException.DATABASE_TRANSACTION_ABORT); >> >> <<<<<< >> >> >> >> I suspect that your version of MCF is old enough so that this particular >> error and the associated retry are not taking place. Upgrading to 2.6 will >> definitely help there. >> >> >> >> Karl >> >> >> >> >> >> On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <[email protected]> >> wrote: >> >> Hi Karl, >> >> Thanks for coming back so quickly. Unfortunately I wasn’t >> using a JCIFS connection. One of the issues I was seeing was between a >> crawl of an intranet site (no explicit throttling other than number of >> connections) and scheduled crawl (every 5 mins) to a relational DB via >> JDBC connector again no explicit throttling. To simplify things both jobs >> are using a NULL output connection. Sometimes both the Web crawl and the >> JDBC connection can run together but at other times 1 or both jobs will >> appear to lock up with just a few active documents showing. When I get a >> lock up the mcf log contains errors like: >> >> >> >> “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database >> exception: SQLException doing query (40001): ERROR: could not serialize >> access due to read/write dependencies among transactions” >> >> >> >> See the attached log extract for a little more detail. >> >> >> >> Any view why this might be happening? >> >> >> >> Best Regards, >> >> >> >> Guy >> >> >> >> >> >> >> >> >> >> >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* 03 March 2017 11:27 >> *To:* [email protected] >> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >> >> >> >> Hi Guy, >> >> >> >> A issue with concurrent jobs is known for jobs sharing the same JCIFS >> connection. Is that what you are using? This has nothing to do with the >> version of Postgresql you are using; it has to do with what "bins" >> documents are thought to come from. There has been a recent improvement >> for this issue, which will be released in April. See >> https://issues.apache.org/jira/browse/CONNECTORS-1364. >> >> >> >> The current version of MCF (2.6) supports Solr 6.x. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <[email protected]> >> wrote: >> >> Hi Karl, >> >> I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have >> had some issues with multiple jobs running concurrently. >> >> I am considering upgrading to MCF 2.6 and to a newer version of >> PostgreSQL. Would you be able to advise which version of PostgreSQL I >> should consider using with MCF 2.6 (e.g. PostgreSQL 9.3.16 or 9.6.2) >> >> >> >> I am also considering upgrading from SOLR 4.10.3 to a newer version. The >> MCF compatibility matrix mentions that compatibility has been tested to >> SOLR version 4.5.1. Do you have any advice about compatibility with the >> newer versions of SOLR e.g. 6.4.1. >> >> >> >> Best Regards >> >> >> >> Guy >> >> >> >> >> Unless otherwise stated, this email has been sent from Fujitsu Services >> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in >> England No 2216100) both with registered offices at: 22 Baker Street, >> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and >> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) >> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, >> Middlesex, UB4 8FE. >> This email is only for the use of its intended recipient. Its contents >> are subject to a duty of confidence and may be privileged. Fujitsu does not >> guarantee that this email has not been intercepted and amended or that it >> is virus-free. >> >> >> >> >> Unless otherwise stated, this email has been sent from Fujitsu Services >> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in >> England No 2216100) both with registered offices at: 22 Baker Street, >> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and >> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) >> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, >> Middlesex, UB4 8FE. >> This email is only for the use of its intended recipient. Its contents >> are subject to a duty of confidence and may be privileged. Fujitsu does not >> guarantee that this email has not been intercepted and amended or that it >> is virus-free. >> >> >> >> >> Unless otherwise stated, this email has been sent from Fujitsu Services >> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in >> England No 2216100) both with registered offices at: 22 Baker Street, >> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and >> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) >> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, >> Middlesex, UB4 8FE. >> This email is only for the use of its intended recipient. Its contents >> are subject to a duty of confidence and may be privileged. Fujitsu does not >> guarantee that this email has not been intercepted and amended or that it >> is virus-free. >> >> >> >> Unless otherwise stated, this email has been sent from Fujitsu Services >> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in >> England No 2216100) both with registered offices at: 22 Baker Street, >> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and >> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) >> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, >> Middlesex, UB4 8FE. >> This email is only for the use of its intended recipient. Its contents >> are subject to a duty of confidence and may be privileged. Fujitsu does not >> guarantee that this email has not been intercepted and amended or that it >> is virus-free. >> > >
