Cool thanks Mike.  Mount question/concern resolved.

On Fri, Feb 17, 2017 at 12:33 AM, Mikhail Sosonkin <[email protected]> wrote:
> I'm really happy that you guys responded so well, it's quite lonely googling
> for this stuff :)
>
> Right now the volume is high because nifi is catching up at about 2.5G/5m
> and 500FlowFile/5m, but normally we're at about 100Mb/5m with a few spikes
> here and there, nothing too intense.
>
> We are using an EC2 instance with 32G RAM and 500G SSD. All the work is done
> on the same mount. Not sure what you mean by timestamps in this case. Our
> set up is pretty close to out of the box with only heap size limit change in
> bootstrap and a few Groovy based processors.
>
> I'll try to get you some thread dumps for the decay, might have to wait
> until tomorrow or monday though. I want to see if I can get it to behave
> like this on a test system.
>
> Mike.
>
> On Fri, Feb 17, 2017 at 12:13 AM, Joe Witt <[email protected]> wrote:
>>
>> Mike
>>
>> Totally get it.  If you are able to on this or another system get back
>> into that state we're highly interested to learn more.  In looking at
>> the code relevant to your stack trace I'm not quite seeing the trail
>> just yet.  The problem is definitely with the persistent prov.
>> Getting the phased thread dumps will help tell more of the story.
>>
>> Also, can you tell us anything about the volume/mount that the nifi
>> install and specific provenance is on?  Any interesting mount options
>> involving timestamps, etc..?
>>
>> No rush of course and glad you're back in business.  But, you've
>> definitely got our attention :-)
>>
>> Thanks
>> Joe
>>
>> On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <[email protected]>
>> wrote:
>> > Joe,
>> >
>> > Many thanks for the pointer on the Volatile provenance. It is, indeed,
>> > more
>> > critical for us that the data moves. Before receiving this message, I
>> > changed the config and restarted. The data started moving which is
>> > awesome!
>> >
>> > I'm happy to help you debug this issue. Do you need these collections
>> > with
>> > the volatile setting or persistent setting in locked state?
>> >
>> > Mike.
>> >
>> > On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <[email protected]> wrote:
>> >>
>> >> Mike
>> >>
>> >> One more thing...can you please grab a couple more thread dumps for us
>> >> with 5 to 10 mins between?
>> >>
>> >> I don't see a deadlock but do suspect either just crazy slow IO going
>> >> on or a possible livelock.  The thread dump will help narrow that down
>> >> a bit.
>> >>
>> >> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> >> system too please.
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected]> wrote:
>> >> > Mike,
>> >> >
>> >> > No need for more info.  Heap/GC looks beautiful.
>> >> >
>> >> > The thread dump however, shows some problems.  The provenance
>> >> > repository is locked up.  Numerous threads are sitting here
>> >> >
>> >> > at
>> >> >
>> >> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> >> > at
>> >> >
>> >> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >> >
>> >> > This means these are processors committing their sessions and
>> >> > updating
>> >> > provenance but they're waiting on a readlock to provenance.  This
>> >> > lock
>> >> > cannot be obtained because a provenance maintenance thread is
>> >> > attempting to purge old events and cannot.
>> >> >
>> >> > I recall us having addressed this so am looking to see when that was
>> >> > addressed.  If provenance is not critical for you right now you can
>> >> > swap out the persistent implementation with the volatile provenance
>> >> > repository.  In nifi.properties change this line
>> >> >
>> >> >
>> >> >
>> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >> >
>> >> > to
>> >> >
>> >> >
>> >> >
>> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >> >
>> >> > The behavior reminds me of this issue which was fixed in 1.x
>> >> > https://issues.apache.org/jira/browse/NIFI-2395
>> >> >
>> >> > Need to dig into this more...
>> >> >
>> >> > Thanks
>> >> > Joe
>> >> >
>> >> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin
>> >> > <[email protected]>
>> >> > wrote:
>> >> >> Hi Joe,
>> >> >>
>> >> >> Thank you for your quick response. The system is currently in the
>> >> >> deadlock
>> >> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> >> requested.
>> >> >>
>> >> >> - The available space on the partition is 223G free of 500G (same as
>> >> >> was
>> >> >> available for 0.6.1)
>> >> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> >> - thread dump and jstats are here
>> >> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >> >>
>> >> >> Unfortunately, it's hard to predict when the decay starts and it
>> >> >> takes
>> >> >> too
>> >> >> long to have to monitor the system manually. However, if you still
>> >> >> need,
>> >> >> after seeing the attached dumps, the thread dumps while it decays I
>> >> >> can
>> >> >> set
>> >> >> up a timer script.
>> >> >>
>> >> >> Let me know if you need any more info.
>> >> >>
>> >> >> Thanks,
>> >> >> Mike.
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <[email protected]>
>> >> >> wrote:
>> >> >>>
>> >> >>> Mike,
>> >> >>>
>> >> >>> Can you capture a series of thread dumps as the gradual decay
>> >> >>> occurs
>> >> >>> and signal at what point they were generated specifically calling
>> >> >>> out
>> >> >>> the "now the system is doing nothing" point.  Can you check for
>> >> >>> space
>> >> >>> available on the system during these times as well.  Also, please
>> >> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >> >>> always) a gradual decay in performance can suggest an issue with GC
>> >> >>> as
>> >> >>> you know.  Can you run something like
>> >> >>>
>> >> >>> jstat -gcutil -h5 <pid> 1000
>> >> >>>
>> >> >>> And capture those rules in these chunks as well.
>> >> >>>
>> >> >>> This would give us a pretty good picture of the health of the
>> >> >>> system/
>> >> >>> and JVM around these times.  It is probably too much for the
>> >> >>> mailing
>> >> >>> list for the info so feel free to create a JIRA for this and put
>> >> >>> attachments there or link to gists in github/etc.
>> >> >>>
>> >> >>> Pretty confident we can get to the bottom of what you're seeing
>> >> >>> quickly.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Joe
>> >> >>>
>> >> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin
>> >> >>> <[email protected]>
>> >> >>> wrote:
>> >> >>> > Hello,
>> >> >>> >
>> >> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first
>> >> >>> > everything
>> >> >>> > was
>> >> >>> > working well. However, a few hours later none of the processors
>> >> >>> > were
>> >> >>> > showing
>> >> >>> > any activity. Then, I tried restarting nifi which caused some
>> >> >>> > flowfiles
>> >> >>> > to
>> >> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >> >>> > however
>> >> >>> > the processors still continue to produce no activity. Next, I
>> >> >>> > stop
>> >> >>> > the
>> >> >>> > service and delete all state (content_repository
>> >> >>> > database_repository
>> >> >>> > flowfile_repository provenance_repository work). Then the
>> >> >>> > processors
>> >> >>> > start
>> >> >>> > working for a few hours (maybe a day) until the deadlock occurs
>> >> >>> > again.
>> >> >>> >
>> >> >>> > So, this cycle continues where I have to periodically reset the
>> >> >>> > service
>> >> >>> > and
>> >> >>> > delete the state to get things moving. Obviously, that's not
>> >> >>> > great.
>> >> >>> > I'll
>> >> >>> > note that the flow.xml file has been changed, as I added/removed
>> >> >>> > processors,
>> >> >>> > by the new version of nifi but 95% of the flow configuration is
>> >> >>> > the
>> >> >>> > same
>> >> >>> > as
>> >> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >> >>> > setting
>> >> >>> > that causes these deadlocks.
>> >> >>> >
>> >> >>> > What I've been able to observe is that the deadlock is "gradual"
>> >> >>> > in
>> >> >>> > that
>> >> >>> > my
>> >> >>> > flow usually takes about 4-5 threads to execute. The deadlock
>> >> >>> > causes
>> >> >>> > the
>> >> >>> > worker threads to max out at the limit and I'm not even able to
>> >> >>> > stop
>> >> >>> > any
>> >> >>> > processors or list queues. I also, have not seen this behavior in
>> >> >>> > a
>> >> >>> > fresh
>> >> >>> > install of Nifi where the flow.xml would start out empty.
>> >> >>> >
>> >> >>> > Can you give me some advise on what to do about this? Would the
>> >> >>> > problem
>> >> >>> > be
>> >> >>> > resolved if I manually rebuild the flow with the new version of
>> >> >>> > Nifi
>> >> >>> > (not
>> >> >>> > looking forward to that)?
>> >> >>> >
>> >> >>> > Much appreciated.
>> >> >>> >
>> >> >>> > Mike.
>> >> >>> >
>> >> >>> > This email may contain material that is confidential for the sole
>> >> >>> > use of
>> >> >>> > the
>> >> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >> >>> > disclosure
>> >> >>> > by others without express permission is strictly prohibited.  If
>> >> >>> > you
>> >> >>> > are
>> >> >>> > not
>> >> >>> > the intended recipient, please contact the sender and delete all
>> >> >>> > copies
>> >> >>> > of
>> >> >>> > this message.
>> >> >>
>> >> >>
>> >> >>
>> >> >> This email may contain material that is confidential for the sole
>> >> >> use
>> >> >> of the
>> >> >> intended recipient(s).  Any review, reliance or distribution or
>> >> >> disclosure
>> >> >> by others without express permission is strictly prohibited.  If you
>> >> >> are not
>> >> >> the intended recipient, please contact the sender and delete all
>> >> >> copies
>> >> >> of
>> >> >> this message.
>> >
>> >
>> >
>> > This email may contain material that is confidential for the sole use of
>> > the
>> > intended recipient(s).  Any review, reliance or distribution or
>> > disclosure
>> > by others without express permission is strictly prohibited.  If you are
>> > not
>> > the intended recipient, please contact the sender and delete all copies
>> > of
>> > this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Reply via email to