Here's an explanation for Postgresql about what is supposed to happen in this case. See slide 7.
https://www.postgresql.org/files/developer/concurrency.pdf The explanation should apply to all MVCC databases as well. The sequence we're seeing is: Thread 1: <start transaction> <do_expensive_stuffer_query> <update_jobqueue_status> <end transaction> <hand records to worker threads> Worker thread: <process document> <begin transaction> <change jobqueue status from active to completed> <end transaction The worker thread, however, is finding that the jobqueue status is not what was expected. It's as if transactions didn't actually serialize, once in a great while, and either the original stuffer update never became visible by the time the worker thread completed its work, or some other thread modified it while the worker thread was processing the document. But I've got every jobqueue job status write instrumented so if that were what was happening we'd see it in the diagnostic dump, and we do not. I wish it were reasonably possible to hand the postgresql folks a test case that demonstrated this behavior. But, as you know, the problem is quite rare, and you have to pound on the database for an extended period of time before you get this to happen. :-( Karl On Fri, Jun 7, 2019 at 3:28 AM Karl Wright <[email protected]> wrote: > I just reviewed the places where JobQueue status is updated to be sure > that all places this happens are included in the diagnostics trackiing. > They were, save for when a cluster gets restarted, and I added that to the > code in trunk as well. > > So the analysis, above, stands. > > :-( > > Karl > > > On Thu, Jun 6, 2019 at 4:28 PM Karl Wright <[email protected]> wrote: > >> I just looked at this briefly. >> >> For >> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204, >> look for the following record ID: >> >> 1522147023170 >> >> This is the one when we look at it after the document is processed that it >> has status 4: >> >> >>>>>> >> >> manifoldcf_1 | 2019-06-06T02:01:05.380401275Z 2019-06-06T04:01:05,380 >> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: ==== Forensics >> for record 1522147023170, current status: 4 ==== >> >> manifoldcf_1 | 2019-06-06T02:01:05.380512360Z 2019-06-06T04:01:05,380 >> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: === Current >> stack trace === >> manifoldcf_1 | 2019-06-06T02:01:05.380544192Z java.lang.Exception: >> Forensics stack trace >> manifoldcf_1 | 2019-06-06T02:01:05.380548494Z at >> org.apache.manifoldcf.crawler.jobs.TrackerClass.printForensics(TrackerClass.java:216) >> [mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.380552564Z at >> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1028) >> [mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.380568210Z at >> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3414) >> [mcf-pull-agent.jar:?] >> >> <<<<<< >> >> It expects it to have status 5, though, because the only way the document >> could be being worked on at all is if the document had been put into status >> 5 by the stuffer thread: >> >> >>>>>> >> >> manifoldcf_1 | 2019-06-06T02:01:05.387210733Z 2019-06-06T04:01:05,381 >> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: == About to >> commit transaction by 'Stuffer thread' at 1559786464972 == >> manifoldcf_1 | 2019-06-06T02:01:05.387250827Z java.lang.Exception: >> Precommit stack trace >> manifoldcf_1 | 2019-06-06T02:01:05.387256464Z at >> org.apache.manifoldcf.crawler.jobs.TrackerClass.notePrecommit(TrackerClass.java:108) >> ~[mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.387260987Z at >> org.apache.manifoldcf.crawler.jobs.JobManager.fetchAndProcessDocuments(JobManager.java:3295) >> ~[mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.387265034Z at >> org.apache.manifoldcf.crawler.jobs.JobManager.getNextDocuments(JobManager.java:3014) >> ~[mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.387269073Z at >> org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:186) >> ~[mcf-pull-agent.jar:?] >> manifoldcf_1 | 2019-06-06T02:01:05.387285663Z 2019-06-06T04:01:05,387 >> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: >> Transaction includes: >> manifoldcf_1 | 2019-06-06T02:01:05.387290117Z 2019-06-06T04:01:05,387 >> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: Record >> 1522147023170 status modified to 5: Make active >> >> ... >> >> <<<<<< >> >> The actual commit follows a short while later to actually set the status, >> and only AFTER that are the documents queued to be handed to the worker >> threads. >> >> So what this is saying is that the stuffer thread sets the document status >> to 5 and then a worker thread gets it and when the worker thread is done it >> tries to set the status back to 4 -- but it's ALREADY at 4 and that is not >> possible. >> >> Now, it may be worth examining the possibility that some other job gets to >> the document while it is waiting to be queued, and does stuff to the >> document's jobqueue record. There are a couple of background threads which >> (for example) reprioritize documents that might conceivably change the >> jobqueue record's status. They generally work by changing the record status >> in a transaction to some other value that would not make the record eligible >> for queuing, and then changing it back when done. But unless completely >> overlooked, these would, however, be visible in the diagnostics, and >> furthermore it should not be possible for one thread in a transaction to >> change the status at the same time as another thread in a transaction >> changes the status. >> >> So either way I'm forced to conclude that a record gets changed and >> committed, but the change is not visible to another thread later trying to >> reset the status. >> >> Karl >> >> >> >> >> On Thu, Jun 6, 2019 at 5:58 AM Karl Wright <[email protected]> wrote: >> >>> I can look at the log output but not until this weekend. >>> >>> Karl >>> >>> >>> On Thu, Jun 6, 2019 at 3:59 AM Markus Schuch < >>> [email protected]> wrote: >>> >>>> Hi Olivier, >>>> >>>> >>>> >>>> we were not able to fix this yet. >>>> >>>> >>>> >>>> But now we have new diagnostics log data, after the error occurred >>>> again yesterday: >>>> >>>> >>>> >>>> Unexpected jobqueue status - record id 1522147023170, expecting active >>>> status, saw 4 >>>> >>>> >>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204 >>>> >>>> >>>> >>>> Unexpected jobqueue status - record id 1541529864711, expecting active >>>> status, saw 4 >>>> >>>> >>>> https://gist.githubusercontent.com/schuch/d73672e44b0edd5a9184170c9ee417a5/raw/f55b1c023056ef8e310d99489a0b6f9dfcaaa585/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201541529864711,%2520expecting%2520active%2520status,%2520saw%25204 >>>> >>>> >>>> >>>> @Karl, maybe you can get some ideas from that log data? >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Markus >>>> >>>> >>>> >>>> *Von:* Olivier Tavard <[email protected]> >>>> *Gesendet:* Dienstag, 21. Mai 2019 11:00 >>>> *An:* [email protected] >>>> *Betreff:* Re: Error: Unexpected jobqueue status - record id X, >>>> expecting active status, saw 4 (MySQL compatible Database) >>>> >>>> >>>> >>>> Hi Markus, >>>> >>>> >>>> >>>> We have the same error (with postgresql database). Did the error occur >>>> again since your last mail ? >>>> >>>> Did you change something on your MCF configuration to fix this ? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Best regards, >>>> >>>> >>>> >>>> Olivier >>>> >>>> >>>> >>>> >>>> >>>> Le 13 févr. 2019 à 13:58, Markus Schuch <[email protected]> >>>> a écrit : >>>> >>>> >>>> >>>> Hi Karl, >>>> >>>> >>>> >>>> we set the diagnostigs logger to level debug. >>>> >>>> >>>> >>>> I will get back when the error occurs again. >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Markus >>>> >>>> >>>> ------------------------------ >>>> >>>> >>>> Pflichtangaben anzeigen >>>> <http://www.deutschebahn.com/pflichtangaben/20190211> >>>> >>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>> >>>> >>>> >>>> Le 12 févr. 2019 à 17:41, Karl Wright <[email protected]> a écrit : >>>> >>>> >>>> >>>> Hi Marcus, >>>> >>>> >>>> >>>> There's a properties.xml debugging logger you can enable that will keep >>>> track of what's happening with transactions, so that when an error of this >>>> kinds is reported, information about why the situation is unexpected is >>>> dumped to the log. The logger is called "diagnostics" e.g. >>>> "org.apache.manifoldcf.diagnostics". >>>> >>>> >>>> >>>> Karl >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Feb 12, 2019 at 10:53 AM Markus Schuch < >>>> [email protected]> wrote: >>>> >>>> Hi, >>>> >>>> we are seeing "Error: Unexpected jobqueue status - record id >>>> 1484612513829, expecting active status, saw 4" from time to time. >>>> I didn’t report it, because we were running on an old MCF version, and >>>> there were some bugreports relating to this error that are resolved in >>>> newer versions. >>>> >>>> Now we have upgraded to latest and greatest and the error still >>>> occurred. So want to start to track this down. >>>> >>>> Out setup is: >>>> >>>> - ManifoldCF 2.12, running in a Docker Container based on Redhat >>>> Linux, OpenJDK 8 >>>> - AWS RDS Database (Aurora MySQL -> 5.6 compatible) >>>> - Single Process Setup >>>> >>>> >>>> We run several Jobs (run once mode) over night with schedule windows >>>> and max runtime settings. Some of the time windows are overlapping. >>>> At normal days some Jobs finish during their time window, some go to >>>> WAITING and will go on in the next night. >>>> >>>> The error mostly hits at a Sharepoint Repository Job. >>>> We have the impression, that the error is somehow related to >>>> situations, when service interruptions (e.g. connection issues) occur in >>>> other jobs. >>>> >>>> >>>> >>>> Maybe all this is related to fundamental problems with the database or >>>> the JDBC driver >>>> >>>> (Karl expressed his concerns about that in the maybe-related >>>> https://issues.apache.org/jira/browse/CONNECTORS-1581 ticket) >>>> >>>> On the other hand, there were similar bugreports in the past concerning >>>> unexpected jobqueue status and there were resolved. >>>> >>>> Mayby there is a chance this can also be analyzed and fixed, but I do >>>> not really know where to start. >>>> >>>> >>>> Has anybody ideas how we can track this down / debug this to get to the >>>> bottom of the problem? >>>> What could be the first step in the analysis? >>>> >>>> Many thanks in advance >>>> Markus >>>> >>>> May be related bugreports: >>>> >>>> >>>> >>>> [0] https://issues.apache.org/jira/browse/CONNECTORS-1395 >>>> [1] https://issues.apache.org/jira/browse/CONNECTORS-1180 >>>> [2] https://issues.apache.org/jira/browse/CONNECTORS-590 >>>> [3] https://issues.apache.org/jira/browse/CONNECTORS-246 >>>> >>>> >>>> >>>> -- >>>> >>>> Markus Schuch >>>> >>>> Web Business (T.IPB 26) >>>> >>>> >>>> >>>> DB Systel GmbH >>>> >>>> Jürgen-Ponto-Platz 1, 60329 Frankfurt a. Main >>>> >>>> >>>> ------------------------------ >>>> >>>> >>>> Pflichtangaben anzeigen >>>> <http://www.deutschebahn.com/pflichtangaben/20190211> >>>> >>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Pflichtangaben anzeigen >>>> <http://www.deutschebahn.com/pflichtangaben/20190529> >>>> >>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>> >>>
