Mike One more thing...can you please grab a couple more thread dumps for us with 5 to 10 mins between?
I don't see a deadlock but do suspect either just crazy slow IO going on or a possible livelock. The thread dump will help narrow that down a bit. Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the system too please. Thanks Joe On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected]> wrote: > Mike, > > No need for more info. Heap/GC looks beautiful. > > The thread dump however, shows some problems. The provenance > repository is locked up. Numerous threads are sitting here > > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757) > > This means these are processors committing their sessions and updating > provenance but they're waiting on a readlock to provenance. This lock > cannot be obtained because a provenance maintenance thread is > attempting to purge old events and cannot. > > I recall us having addressed this so am looking to see when that was > addressed. If provenance is not critical for you right now you can > swap out the persistent implementation with the volatile provenance > repository. In nifi.properties change this line > > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository > > to > > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository > > The behavior reminds me of this issue which was fixed in 1.x > https://issues.apache.org/jira/browse/NIFI-2395 > > Need to dig into this more... > > Thanks > Joe > > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <[email protected]> wrote: >> Hi Joe, >> >> Thank you for your quick response. The system is currently in the deadlock >> state with 10 worker threads spinning. So, I'll gather the info you >> requested. >> >> - The available space on the partition is 223G free of 500G (same as was >> available for 0.6.1) >> - java.arg.3=-Xmx4096m in bootstrap.conf >> - thread dump and jstats are here >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085 >> >> Unfortunately, it's hard to predict when the decay starts and it takes too >> long to have to monitor the system manually. However, if you still need, >> after seeing the attached dumps, the thread dumps while it decays I can set >> up a timer script. >> >> Let me know if you need any more info. >> >> Thanks, >> Mike. >> >> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <[email protected]> wrote: >>> >>> Mike, >>> >>> Can you capture a series of thread dumps as the gradual decay occurs >>> and signal at what point they were generated specifically calling out >>> the "now the system is doing nothing" point. Can you check for space >>> available on the system during these times as well. Also, please >>> advise on the behavior of the heap/garbage collection. Often (not >>> always) a gradual decay in performance can suggest an issue with GC as >>> you know. Can you run something like >>> >>> jstat -gcutil -h5 <pid> 1000 >>> >>> And capture those rules in these chunks as well. >>> >>> This would give us a pretty good picture of the health of the system/ >>> and JVM around these times. It is probably too much for the mailing >>> list for the info so feel free to create a JIRA for this and put >>> attachments there or link to gists in github/etc. >>> >>> Pretty confident we can get to the bottom of what you're seeing quickly. >>> >>> Thanks >>> Joe >>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <[email protected]> >>> wrote: >>> > Hello, >>> > >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was >>> > working well. However, a few hours later none of the processors were >>> > showing >>> > any activity. Then, I tried restarting nifi which caused some flowfiles >>> > to >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log, >>> > however >>> > the processors still continue to produce no activity. Next, I stop the >>> > service and delete all state (content_repository database_repository >>> > flowfile_repository provenance_repository work). Then the processors >>> > start >>> > working for a few hours (maybe a day) until the deadlock occurs again. >>> > >>> > So, this cycle continues where I have to periodically reset the service >>> > and >>> > delete the state to get things moving. Obviously, that's not great. I'll >>> > note that the flow.xml file has been changed, as I added/removed >>> > processors, >>> > by the new version of nifi but 95% of the flow configuration is the same >>> > as >>> > before the upgrade. So, I'm wondering if there is a configuration >>> > setting >>> > that causes these deadlocks. >>> > >>> > What I've been able to observe is that the deadlock is "gradual" in that >>> > my >>> > flow usually takes about 4-5 threads to execute. The deadlock causes the >>> > worker threads to max out at the limit and I'm not even able to stop any >>> > processors or list queues. I also, have not seen this behavior in a >>> > fresh >>> > install of Nifi where the flow.xml would start out empty. >>> > >>> > Can you give me some advise on what to do about this? Would the problem >>> > be >>> > resolved if I manually rebuild the flow with the new version of Nifi >>> > (not >>> > looking forward to that)? >>> > >>> > Much appreciated. >>> > >>> > Mike. >>> > >>> > This email may contain material that is confidential for the sole use of >>> > the >>> > intended recipient(s). Any review, reliance or distribution or >>> > disclosure >>> > by others without express permission is strictly prohibited. If you are >>> > not >>> > the intended recipient, please contact the sender and delete all copies >>> > of >>> > this message. >> >> >> >> This email may contain material that is confidential for the sole use of the >> intended recipient(s). Any review, reliance or distribution or disclosure >> by others without express permission is strictly prohibited. If you are not >> the intended recipient, please contact the sender and delete all copies of >> this message.
