Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Russell Bateman Fri, 17 Feb 2017 08:31:24 -0800

Mikhail,

I have a short article with step-by-step information and comments on howI profile NiFi. You'll want the latest NiFi release, however, becausethe Java Flight Recorder JVM arguments are very order-dependent. (I'massuming that NiFi 1.1.2 and 0.7.2 have the fix for/conf/bootstrap.conf/ numeric-argument order.) I've been using this fora couple of months and finally got around to writing it up from mypersonal notes in a more usable form:


http://www.javahotchocolate.com/notes/jfr.html

I hope this is helpful.

Russ

On 02/16/2017 10:18 PM, Mikhail Sosonkin wrote:

Been a while since I've used a profiler, but I'll give it a shot whenI get to a place with faster internet link :)

On Fri, Feb 17, 2017 at 12:08 AM, Tony Kurc <[email protected]<mailto:[email protected]>> wrote:


    Mike, also if what Joe asked with the backpressure is "not being
    applied", if you're good with a profiler, I think joe and I both
    gravitated to 0x00000006c533b770 being locked in at
    
org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757).
    It would be interesting to see if that section is taking longer
    over time.

    On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <[email protected]
    <mailto:[email protected]>> wrote:

        Mike

        One more thing...can you please grab a couple more thread
        dumps for us
        with 5 to 10 mins between?

        I don't see a deadlock but do suspect either just crazy slow
        IO going
        on or a possible livelock.  The thread dump will help narrow
        that down
        a bit.

        Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
        system too please.

        Thanks
        Joe

        On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected]
        <mailto:[email protected]>> wrote:
        > Mike,
        >
        > No need for more info.  Heap/GC looks beautiful.
        >
        > The thread dump however, shows some problems.  The provenance
        > repository is locked up.  Numerous threads are sitting here
        >
        > at
        
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        > at
        
org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
        >
        > This means these are processors committing their sessions
        and updating

> provenance but they're waiting on a readlock to provenance.This lock

        > cannot be obtained because a provenance maintenance thread is
        > attempting to purge old events and cannot.
        >
        > I recall us having addressed this so am looking to see when
        that was
        > addressed.  If provenance is not critical for you right now
        you can
        > swap out the persistent implementation with the volatile
        provenance
        > repository.  In nifi.properties change this line
        >
        >
        
nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
        >
        > to
        >
        >
        
nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
        >
        > The behavior reminds me of this issue which was fixed in 1.x
        > https://issues.apache.org/jira/browse/NIFI-2395
        <https://issues.apache.org/jira/browse/NIFI-2395>
        >
        > Need to dig into this more...
        >
        > Thanks
        > Joe
        >
        > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin
        <[email protected] <mailto:[email protected]>> wrote:
        >> Hi Joe,
        >>
        >> Thank you for your quick response. The system is currently
        in the deadlock
        >> state with 10 worker threads spinning. So, I'll gather the
        info you
        >> requested.
        >>
        >> - The available space on the partition is 223G free of 500G
        (same as was
        >> available for 0.6.1)
        >> - java.arg.3=-Xmx4096m in bootstrap.conf
        >> - thread dump and jstats are here
        >>
        https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
        <https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085>
        >>
        >> Unfortunately, it's hard to predict when the decay starts
        and it takes too
        >> long to have to monitor the system manually. However, if
        you still need,
        >> after seeing the attached dumps, the thread dumps while it
        decays I can set
        >> up a timer script.
        >>
        >> Let me know if you need any more info.
        >>
        >> Thanks,
        >> Mike.
        >>
        >>
        >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt
        <[email protected] <mailto:[email protected]>> wrote:
        >>>
        >>> Mike,
        >>>
        >>> Can you capture a series of thread dumps as the gradual
        decay occurs
        >>> and signal at what point they were generated specifically
        calling out
        >>> the "now the system is doing nothing" point.  Can you
        check for space
        >>> available on the system during these times as well.  Also,
        please

>>> advise on the behavior of the heap/garbage collection.Often (not

        >>> always) a gradual decay in performance can suggest an
        issue with GC as
        >>> you know.  Can you run something like
        >>>
        >>> jstat -gcutil -h5 <pid> 1000
        >>>
        >>> And capture those rules in these chunks as well.
        >>>
        >>> This would give us a pretty good picture of the health of
        the system/
        >>> and JVM around these times.  It is probably too much for
        the mailing
        >>> list for the info so feel free to create a JIRA for this
        and put
        >>> attachments there or link to gists in github/etc.
        >>>
        >>> Pretty confident we can get to the bottom of what you're
        seeing quickly.
        >>>
        >>> Thanks
        >>> Joe
        >>>
        >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin
        <[email protected] <mailto:[email protected]>>
        >>> wrote:
        >>> > Hello,
        >>> >
        >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at
        first everything was
        >>> > working well. However, a few hours later none of the
        processors were
        >>> > showing
        >>> > any activity. Then, I tried restarting nifi which caused
        some flowfiles
        >>> > to
        >>> > get corrupted evidenced by exceptions thrown in the
        nifi-app.log,
        >>> > however
        >>> > the processors still continue to produce no activity.
        Next, I stop the
        >>> > service and delete all state (content_repository
        database_repository
        >>> > flowfile_repository provenance_repository work). Then
        the processors
        >>> > start
        >>> > working for a few hours (maybe a day) until the deadlock
        occurs again.
        >>> >
        >>> > So, this cycle continues where I have to periodically
        reset the service
        >>> > and
        >>> > delete the state to get things moving. Obviously, that's
        not great. I'll
        >>> > note that the flow.xml file has been changed, as I
        added/removed
        >>> > processors,
        >>> > by the new version of nifi but 95% of the flow
        configuration is the same
        >>> > as
        >>> > before the upgrade. So, I'm wondering if there is a
        configuration
        >>> > setting
        >>> > that causes these deadlocks.
        >>> >
        >>> > What I've been able to observe is that the deadlock is
        "gradual" in that
        >>> > my
        >>> > flow usually takes about 4-5 threads to execute. The
        deadlock causes the
        >>> > worker threads to max out at the limit and I'm not even
        able to stop any
        >>> > processors or list queues. I also, have not seen this
        behavior in a
        >>> > fresh
        >>> > install of Nifi where the flow.xml would start out empty.
        >>> >
        >>> > Can you give me some advise on what to do about this?
        Would the problem
        >>> > be
        >>> > resolved if I manually rebuild the flow with the new
        version of Nifi
        >>> > (not
        >>> > looking forward to that)?
        >>> >
        >>> > Much appreciated.
        >>> >
        >>> > Mike.
        >>> >
        >>> > This email may contain material that is confidential for
        the sole use of
        >>> > the
        >>> > intended recipient(s).  Any review, reliance or
        distribution or
        >>> > disclosure
        >>> > by others without express permission is strictly
        prohibited.  If you are
        >>> > not
        >>> > the intended recipient, please contact the sender and
        delete all copies
        >>> > of
        >>> > this message.
        >>
        >>
        >>
        >> This email may contain material that is confidential for
        the sole use of the
        >> intended recipient(s).  Any review, reliance or
        distribution or disclosure
        >> by others without express permission is strictly
        prohibited.  If you are not
        >> the intended recipient, please contact the sender and
        delete all copies of
        >> this message.

This email may contain material that is confidential for the sole useof the intended recipient(s). Any review, reliance or distribution ordisclosure by others without express permission is strictlyprohibited. If you are not the intended recipient, please contact thesender and delete all copies of this message.

Re: Deadlocks after upgrade from 0.6.1 to 1.1.1

Reply via email to