Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Andy LoPresto Fri, 17 Feb 2017 16:25:22 -0800

Russ,

The ticket you reference [1] is still open, and I did not see any changes in 
0.7.2 or 1.1.2 that would indicate any fix was included. You can create a PR 
with your code in it (or ask someone to do it if you’re not comfortable with 
GitHub).


https://issues.apache.org/jira/browse/NIFI-3364 
<https://issues.apache.org/jira/browse/NIFI-3364>

Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Feb 17, 2017, at 8:31 AM, Russell Bateman <[email protected]> wrote:
> 
> Mikhail,
> 
> I have a short article with step-by-step information and comments on how I 
> profile NiFi. You'll want the latest NiFi release, however, because the Java 
> Flight Recorder JVM arguments are very order-dependent. (I'm assuming that 
> NiFi 1.1.2 and 0.7.2 have the fix for conf/bootstrap.conf numeric-argument 
> order.) I've been using this for a couple of months and finally got around to 
> writing it up from my personal notes in a more usable form:
> 
> http://www.javahotchocolate.com/notes/jfr.html 
> <http://www.javahotchocolate.com/notes/jfr.html>
> 
> I hope this is helpful.
> 
> Russ
> 
> On 02/16/2017 10:18 PM, Mikhail Sosonkin wrote:
>> Been a while since I've used a profiler, but I'll give it a shot when I get 
>> to a place with faster internet link :)
>> 
>> On Fri, Feb 17, 2017 at 12:08 AM, Tony Kurc <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Mike, also if what Joe asked with the backpressure is "not being applied", 
>> if you're good with a profiler, I think joe and I both gravitated to 
>> 0x00000006c533b770 being locked in at 
>> org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757).
>>  It would be interesting to see if that section is taking longer over time.
>> 
>> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt < 
>> <mailto:[email protected]>[email protected] <mailto:[email protected]>> 
>> wrote:
>> Mike
>> 
>> One more thing...can you please grab a couple more thread dumps for us
>> with 5 to 10 mins between?
>> 
>> I don't see a deadlock but do suspect either just crazy slow IO going
>> on or a possible livelock.  The thread dump will help narrow that down
>> a bit.
>> 
>> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> system too please.
>> 
>> Thanks
>> Joe
>> 
>> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected] 
>> <mailto:[email protected]>> wrote:
>> > Mike,
>> >
>> > No need for more info.  Heap/GC looks beautiful.
>> >
>> > The thread dump however, shows some problems.  The provenance
>> > repository is locked up.  Numerous threads are sitting here
>> >
>> > at 
>> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> > at 
>> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >
>> > This means these are processors committing their sessions and updating
>> > provenance but they're waiting on a readlock to provenance.  This lock
>> > cannot be obtained because a provenance maintenance thread is
>> > attempting to purge old events and cannot.
>> >
>> > I recall us having addressed this so am looking to see when that was
>> > addressed.  If provenance is not critical for you right now you can
>> > swap out the persistent implementation with the volatile provenance
>> > repository.  In nifi.properties change this line
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >
>> > to
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >
>> > The behavior reminds me of this issue which was fixed in 1.x
>> > https://issues.apache.org/jira/browse/NIFI-2395 
>> > <https://issues.apache.org/jira/browse/NIFI-2395>
>> >
>> > Need to dig into this more...
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >> Hi Joe,
>> >>
>> >> Thank you for your quick response. The system is currently in the deadlock
>> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> requested.
>> >>
>> >> - The available space on the partition is 223G free of 500G (same as was
>> >> available for 0.6.1)
>> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> - thread dump and jstats are here
>> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085 
>> >> <https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085>
>> >>
>> >> Unfortunately, it's hard to predict when the decay starts and it takes too
>> >> long to have to monitor the system manually. However, if you still need,
>> >> after seeing the attached dumps, the thread dumps while it decays I can 
>> >> set
>> >> up a timer script.
>> >>
>> >> Let me know if you need any more info.
>> >>
>> >> Thanks,
>> >> Mike.
>> >>
>> >>
>> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <[email protected] 
>> >> <mailto:[email protected]>> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>> and signal at what point they were generated specifically calling out
>> >>> the "now the system is doing nothing" point.  Can you check for space
>> >>> available on the system during these times as well.  Also, please
>> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>> always) a gradual decay in performance can suggest an issue with GC as
>> >>> you know.  Can you run something like
>> >>>
>> >>> jstat -gcutil -h5 <pid> 1000
>> >>>
>> >>> And capture those rules in these chunks as well.
>> >>>
>> >>> This would give us a pretty good picture of the health of the system/
>> >>> and JVM around these times.  It is probably too much for the mailing
>> >>> list for the info so feel free to create a JIRA for this and put
>> >>> attachments there or link to gists in github/etc.
>> >>>
>> >>> Pretty confident we can get to the bottom of what you're seeing quickly.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <[email protected] 
>> >>> <mailto:[email protected]>>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything 
>> >>> > was
>> >>> > working well. However, a few hours later none of the processors were
>> >>> > showing
>> >>> > any activity. Then, I tried restarting nifi which caused some flowfiles
>> >>> > to
>> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>> > however
>> >>> > the processors still continue to produce no activity. Next, I stop the
>> >>> > service and delete all state (content_repository database_repository
>> >>> > flowfile_repository provenance_repository work). Then the processors
>> >>> > start
>> >>> > working for a few hours (maybe a day) until the deadlock occurs again.
>> >>> >
>> >>> > So, this cycle continues where I have to periodically reset the service
>> >>> > and
>> >>> > delete the state to get things moving. Obviously, that's not great. 
>> >>> > I'll
>> >>> > note that the flow.xml file has been changed, as I added/removed
>> >>> > processors,
>> >>> > by the new version of nifi but 95% of the flow configuration is the 
>> >>> > same
>> >>> > as
>> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>> > setting
>> >>> > that causes these deadlocks.
>> >>> >
>> >>> > What I've been able to observe is that the deadlock is "gradual" in 
>> >>> > that
>> >>> > my
>> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes 
>> >>> > the
>> >>> > worker threads to max out at the limit and I'm not even able to stop 
>> >>> > any
>> >>> > processors or list queues. I also, have not seen this behavior in a
>> >>> > fresh
>> >>> > install of Nifi where the flow.xml would start out empty.
>> >>> >
>> >>> > Can you give me some advise on what to do about this? Would the problem
>> >>> > be
>> >>> > resolved if I manually rebuild the flow with the new version of Nifi
>> >>> > (not
>> >>> > looking forward to that)?
>> >>> >
>> >>> > Much appreciated.
>> >>> >
>> >>> > Mike.
>> >>> >
>> >>> > This email may contain material that is confidential for the sole use 
>> >>> > of
>> >>> > the
>> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >>> > disclosure
>> >>> > by others without express permission is strictly prohibited.  If you 
>> >>> > are
>> >>> > not
>> >>> > the intended recipient, please contact the sender and delete all copies
>> >>> > of
>> >>> > this message.
>> >>
>> >>
>> >>
>> >> This email may contain material that is confidential for the sole use of 
>> >> the
>> >> intended recipient(s).  Any review, reliance or distribution or disclosure
>> >> by others without express permission is strictly prohibited.  If you are 
>> >> not
>> >> the intended recipient, please contact the sender and delete all copies of
>> >> this message.
>> 
>> 
>> 
>> This email may contain material that is confidential for the sole use of the 
>> intended recipient(s).  Any review, reliance or distribution or disclosure 
>> by others without express permission is strictly prohibited.  If you are not 
>> the intended recipient, please contact the sender and delete all copies of 
>> this message.
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Reply via email to