Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Joe Witt Thu, 16 Feb 2017 20:57:06 -0800

Mike

One more thing...can you please grab a couple more thread dumps for us
with 5 to 10 mins between?


I don't see a deadlock but do suspect either just crazy slow IO going
on or a possible livelock.  The thread dump will help narrow that down
a bit.

Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
system too please.

Thanks
Joe

On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected]> wrote:
> Mike,
>
> No need for more info.  Heap/GC looks beautiful.
>
> The thread dump however, shows some problems.  The provenance
> repository is locked up.  Numerous threads are sitting here
>
> at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> at 
> org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>
> This means these are processors committing their sessions and updating
> provenance but they're waiting on a readlock to provenance.  This lock
> cannot be obtained because a provenance maintenance thread is
> attempting to purge old events and cannot.
>
> I recall us having addressed this so am looking to see when that was
> addressed.  If provenance is not critical for you right now you can
> swap out the persistent implementation with the volatile provenance
> repository.  In nifi.properties change this line
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>
> to
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>
> The behavior reminds me of this issue which was fixed in 1.x
> https://issues.apache.org/jira/browse/NIFI-2395
>
> Need to dig into this more...
>
> Thanks
> Joe
>
> On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <[email protected]> wrote:
>> Hi Joe,
>>
>> Thank you for your quick response. The system is currently in the deadlock
>> state with 10 worker threads spinning. So, I'll gather the info you
>> requested.
>>
>> - The available space on the partition is 223G free of 500G (same as was
>> available for 0.6.1)
>> - java.arg.3=-Xmx4096m in bootstrap.conf
>> - thread dump and jstats are here
>> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>>
>> Unfortunately, it's hard to predict when the decay starts and it takes too
>> long to have to monitor the system manually. However, if you still need,
>> after seeing the attached dumps, the thread dumps while it decays I can set
>> up a timer script.
>>
>> Let me know if you need any more info.
>>
>> Thanks,
>> Mike.
>>
>>
>> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <[email protected]> wrote:
>>>
>>> Mike,
>>>
>>> Can you capture a series of thread dumps as the gradual decay occurs
>>> and signal at what point they were generated specifically calling out
>>> the "now the system is doing nothing" point.  Can you check for space
>>> available on the system during these times as well.  Also, please
>>> advise on the behavior of the heap/garbage collection.  Often (not
>>> always) a gradual decay in performance can suggest an issue with GC as
>>> you know.  Can you run something like
>>>
>>> jstat -gcutil -h5 <pid> 1000
>>>
>>> And capture those rules in these chunks as well.
>>>
>>> This would give us a pretty good picture of the health of the system/
>>> and JVM around these times.  It is probably too much for the mailing
>>> list for the info so feel free to create a JIRA for this and put
>>> attachments there or link to gists in github/etc.
>>>
>>> Pretty confident we can get to the bottom of what you're seeing quickly.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <[email protected]>
>>> wrote:
>>> > Hello,
>>> >
>>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything was
>>> > working well. However, a few hours later none of the processors were
>>> > showing
>>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>>> > to
>>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>>> > however
>>> > the processors still continue to produce no activity. Next, I stop the
>>> > service and delete all state (content_repository database_repository
>>> > flowfile_repository provenance_repository work). Then the processors
>>> > start
>>> > working for a few hours (maybe a day) until the deadlock occurs again.
>>> >
>>> > So, this cycle continues where I have to periodically reset the service
>>> > and
>>> > delete the state to get things moving. Obviously, that's not great. I'll
>>> > note that the flow.xml file has been changed, as I added/removed
>>> > processors,
>>> > by the new version of nifi but 95% of the flow configuration is the same
>>> > as
>>> > before the upgrade. So, I'm wondering if there is a configuration
>>> > setting
>>> > that causes these deadlocks.
>>> >
>>> > What I've been able to observe is that the deadlock is "gradual" in that
>>> > my
>>> > flow usually takes about 4-5 threads to execute. The deadlock causes the
>>> > worker threads to max out at the limit and I'm not even able to stop any
>>> > processors or list queues. I also, have not seen this behavior in a
>>> > fresh
>>> > install of Nifi where the flow.xml would start out empty.
>>> >
>>> > Can you give me some advise on what to do about this? Would the problem
>>> > be
>>> > resolved if I manually rebuild the flow with the new version of Nifi
>>> > (not
>>> > looking forward to that)?
>>> >
>>> > Much appreciated.
>>> >
>>> > Mike.
>>> >
>>> > This email may contain material that is confidential for the sole use of
>>> > the
>>> > intended recipient(s).  Any review, reliance or distribution or
>>> > disclosure
>>> > by others without express permission is strictly prohibited.  If you are
>>> > not
>>> > the intended recipient, please contact the sender and delete all copies
>>> > of
>>> > this message.
>>
>>
>>
>> This email may contain material that is confidential for the sole use of the
>> intended recipient(s).  Any review, reliance or distribution or disclosure
>> by others without express permission is strictly prohibited.  If you are not
>> the intended recipient, please contact the sender and delete all copies of
>> this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Reply via email to