Cool thanks Mike. Mount question/concern resolved.
On Fri, Feb 17, 2017 at 12:33 AM, Mikhail Sosonkin <[email protected]> wrote: > I'm really happy that you guys responded so well, it's quite lonely googling > for this stuff :) > > Right now the volume is high because nifi is catching up at about 2.5G/5m > and 500FlowFile/5m, but normally we're at about 100Mb/5m with a few spikes > here and there, nothing too intense. > > We are using an EC2 instance with 32G RAM and 500G SSD. All the work is done > on the same mount. Not sure what you mean by timestamps in this case. Our > set up is pretty close to out of the box with only heap size limit change in > bootstrap and a few Groovy based processors. > > I'll try to get you some thread dumps for the decay, might have to wait > until tomorrow or monday though. I want to see if I can get it to behave > like this on a test system. > > Mike. > > On Fri, Feb 17, 2017 at 12:13 AM, Joe Witt <[email protected]> wrote: >> >> Mike >> >> Totally get it. If you are able to on this or another system get back >> into that state we're highly interested to learn more. In looking at >> the code relevant to your stack trace I'm not quite seeing the trail >> just yet. The problem is definitely with the persistent prov. >> Getting the phased thread dumps will help tell more of the story. >> >> Also, can you tell us anything about the volume/mount that the nifi >> install and specific provenance is on? Any interesting mount options >> involving timestamps, etc..? >> >> No rush of course and glad you're back in business. But, you've >> definitely got our attention :-) >> >> Thanks >> Joe >> >> On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <[email protected]> >> wrote: >> > Joe, >> > >> > Many thanks for the pointer on the Volatile provenance. It is, indeed, >> > more >> > critical for us that the data moves. Before receiving this message, I >> > changed the config and restarted. The data started moving which is >> > awesome! >> > >> > I'm happy to help you debug this issue. Do you need these collections >> > with >> > the volatile setting or persistent setting in locked state? >> > >> > Mike. >> > >> > On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <[email protected]> wrote: >> >> >> >> Mike >> >> >> >> One more thing...can you please grab a couple more thread dumps for us >> >> with 5 to 10 mins between? >> >> >> >> I don't see a deadlock but do suspect either just crazy slow IO going >> >> on or a possible livelock. The thread dump will help narrow that down >> >> a bit. >> >> >> >> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the >> >> system too please. >> >> >> >> Thanks >> >> Joe >> >> >> >> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <[email protected]> wrote: >> >> > Mike, >> >> > >> >> > No need for more info. Heap/GC looks beautiful. >> >> > >> >> > The thread dump however, shows some problems. The provenance >> >> > repository is locked up. Numerous threads are sitting here >> >> > >> >> > at >> >> > >> >> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) >> >> > at >> >> > >> >> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757) >> >> > >> >> > This means these are processors committing their sessions and >> >> > updating >> >> > provenance but they're waiting on a readlock to provenance. This >> >> > lock >> >> > cannot be obtained because a provenance maintenance thread is >> >> > attempting to purge old events and cannot. >> >> > >> >> > I recall us having addressed this so am looking to see when that was >> >> > addressed. If provenance is not critical for you right now you can >> >> > swap out the persistent implementation with the volatile provenance >> >> > repository. In nifi.properties change this line >> >> > >> >> > >> >> > >> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository >> >> > >> >> > to >> >> > >> >> > >> >> > >> >> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository >> >> > >> >> > The behavior reminds me of this issue which was fixed in 1.x >> >> > https://issues.apache.org/jira/browse/NIFI-2395 >> >> > >> >> > Need to dig into this more... >> >> > >> >> > Thanks >> >> > Joe >> >> > >> >> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin >> >> > <[email protected]> >> >> > wrote: >> >> >> Hi Joe, >> >> >> >> >> >> Thank you for your quick response. The system is currently in the >> >> >> deadlock >> >> >> state with 10 worker threads spinning. So, I'll gather the info you >> >> >> requested. >> >> >> >> >> >> - The available space on the partition is 223G free of 500G (same as >> >> >> was >> >> >> available for 0.6.1) >> >> >> - java.arg.3=-Xmx4096m in bootstrap.conf >> >> >> - thread dump and jstats are here >> >> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085 >> >> >> >> >> >> Unfortunately, it's hard to predict when the decay starts and it >> >> >> takes >> >> >> too >> >> >> long to have to monitor the system manually. However, if you still >> >> >> need, >> >> >> after seeing the attached dumps, the thread dumps while it decays I >> >> >> can >> >> >> set >> >> >> up a timer script. >> >> >> >> >> >> Let me know if you need any more info. >> >> >> >> >> >> Thanks, >> >> >> Mike. >> >> >> >> >> >> >> >> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <[email protected]> >> >> >> wrote: >> >> >>> >> >> >>> Mike, >> >> >>> >> >> >>> Can you capture a series of thread dumps as the gradual decay >> >> >>> occurs >> >> >>> and signal at what point they were generated specifically calling >> >> >>> out >> >> >>> the "now the system is doing nothing" point. Can you check for >> >> >>> space >> >> >>> available on the system during these times as well. Also, please >> >> >>> advise on the behavior of the heap/garbage collection. Often (not >> >> >>> always) a gradual decay in performance can suggest an issue with GC >> >> >>> as >> >> >>> you know. Can you run something like >> >> >>> >> >> >>> jstat -gcutil -h5 <pid> 1000 >> >> >>> >> >> >>> And capture those rules in these chunks as well. >> >> >>> >> >> >>> This would give us a pretty good picture of the health of the >> >> >>> system/ >> >> >>> and JVM around these times. It is probably too much for the >> >> >>> mailing >> >> >>> list for the info so feel free to create a JIRA for this and put >> >> >>> attachments there or link to gists in github/etc. >> >> >>> >> >> >>> Pretty confident we can get to the bottom of what you're seeing >> >> >>> quickly. >> >> >>> >> >> >>> Thanks >> >> >>> Joe >> >> >>> >> >> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin >> >> >>> <[email protected]> >> >> >>> wrote: >> >> >>> > Hello, >> >> >>> > >> >> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first >> >> >>> > everything >> >> >>> > was >> >> >>> > working well. However, a few hours later none of the processors >> >> >>> > were >> >> >>> > showing >> >> >>> > any activity. Then, I tried restarting nifi which caused some >> >> >>> > flowfiles >> >> >>> > to >> >> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log, >> >> >>> > however >> >> >>> > the processors still continue to produce no activity. Next, I >> >> >>> > stop >> >> >>> > the >> >> >>> > service and delete all state (content_repository >> >> >>> > database_repository >> >> >>> > flowfile_repository provenance_repository work). Then the >> >> >>> > processors >> >> >>> > start >> >> >>> > working for a few hours (maybe a day) until the deadlock occurs >> >> >>> > again. >> >> >>> > >> >> >>> > So, this cycle continues where I have to periodically reset the >> >> >>> > service >> >> >>> > and >> >> >>> > delete the state to get things moving. Obviously, that's not >> >> >>> > great. >> >> >>> > I'll >> >> >>> > note that the flow.xml file has been changed, as I added/removed >> >> >>> > processors, >> >> >>> > by the new version of nifi but 95% of the flow configuration is >> >> >>> > the >> >> >>> > same >> >> >>> > as >> >> >>> > before the upgrade. So, I'm wondering if there is a configuration >> >> >>> > setting >> >> >>> > that causes these deadlocks. >> >> >>> > >> >> >>> > What I've been able to observe is that the deadlock is "gradual" >> >> >>> > in >> >> >>> > that >> >> >>> > my >> >> >>> > flow usually takes about 4-5 threads to execute. The deadlock >> >> >>> > causes >> >> >>> > the >> >> >>> > worker threads to max out at the limit and I'm not even able to >> >> >>> > stop >> >> >>> > any >> >> >>> > processors or list queues. I also, have not seen this behavior in >> >> >>> > a >> >> >>> > fresh >> >> >>> > install of Nifi where the flow.xml would start out empty. >> >> >>> > >> >> >>> > Can you give me some advise on what to do about this? Would the >> >> >>> > problem >> >> >>> > be >> >> >>> > resolved if I manually rebuild the flow with the new version of >> >> >>> > Nifi >> >> >>> > (not >> >> >>> > looking forward to that)? >> >> >>> > >> >> >>> > Much appreciated. >> >> >>> > >> >> >>> > Mike. >> >> >>> > >> >> >>> > This email may contain material that is confidential for the sole >> >> >>> > use of >> >> >>> > the >> >> >>> > intended recipient(s). Any review, reliance or distribution or >> >> >>> > disclosure >> >> >>> > by others without express permission is strictly prohibited. If >> >> >>> > you >> >> >>> > are >> >> >>> > not >> >> >>> > the intended recipient, please contact the sender and delete all >> >> >>> > copies >> >> >>> > of >> >> >>> > this message. >> >> >> >> >> >> >> >> >> >> >> >> This email may contain material that is confidential for the sole >> >> >> use >> >> >> of the >> >> >> intended recipient(s). Any review, reliance or distribution or >> >> >> disclosure >> >> >> by others without express permission is strictly prohibited. If you >> >> >> are not >> >> >> the intended recipient, please contact the sender and delete all >> >> >> copies >> >> >> of >> >> >> this message. >> > >> > >> > >> > This email may contain material that is confidential for the sole use of >> > the >> > intended recipient(s). Any review, reliance or distribution or >> > disclosure >> > by others without express permission is strictly prohibited. If you are >> > not >> > the intended recipient, please contact the sender and delete all copies >> > of >> > this message. > > > > This email may contain material that is confidential for the sole use of the > intended recipient(s). Any review, reliance or distribution or disclosure > by others without express permission is strictly prohibited. If you are not > the intended recipient, please contact the sender and delete all copies of > this message.
