----- Original Message ----- > > > ----- Original Message ----- > > From: "Ayal Baron" <aba...@redhat.com> > > To: "Dan Yasny" <dya...@redhat.com> > > Cc: "Simon Grinberg" <si...@redhat.com>, > > vdsm-devel@lists.fedorahosted.org > > Sent: Friday, 10 February, 2012 2:55:46 PM > > Subject: Re: [vdsm] flowID schema > > > > > > > > ----- Original Message ----- > > > > > > > > > ----- Original Message ----- > > > > From: "Ayal Baron" <aba...@redhat.com> > > > > To: "Dan Yasny" <dya...@redhat.com> > > > > Cc: "Simon Grinberg" <si...@redhat.com>, > > > > vdsm-devel@lists.fedorahosted.org > > > > Sent: Friday, 10 February, 2012 12:50:04 AM > > > > Subject: Re: [vdsm] flowID schema > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Saggi Mizrahi" <smizr...@redhat.com> > > > > > > To: "Keith Robertson" <krobe...@redhat.com> > > > > > > Cc: "VDSM Project Development" > > > > > > <vdsm-devel@lists.fedorahosted.org> > > > > > > Sent: Thursday, February 9, 2012 2:24:44 PM > > > > > > Subject: Re: [vdsm] flowID schema > > > > > > > > > > > > -1 > > > > > > > > > > > > I agree that for messaging environment having a Message ID > > > > > > is > > > > > > a > > > > > > must > > > > > > because you sometimes don't have a particular target so > > > > > > when > > > > > > you > > > > > > get > > > > > > a response you need to know what this node is actually > > > > > > responding > > > > > > to. > > > > > > > > > > > > The message ID could be composed with <FLOWID><MSGID> so > > > > > > you > > > > > > can > > > > > > reuse the field. > > > > > > > > > > > > But that is all besides the point. > > > > > > > > > > > > I understand that someone might find it fun to go on > > > > > > following > > > > > > the > > > > > > entire flow in the Engine and in VDSM. But I would like to > > > > > > hear > > > > > > an > > > > > > actual use case where someone would have actually benefited > > > > > > from > > > > > > this. > > > > > > As I see it having VSDM return the task ID with every > > > > > > response > > > > > > (and > > > > > > not just for async tasks) is a lot more useful and correct. > > > > > > > > > > Actually, the only way to understand what happened in a > > > > > certain > > > > > flow > > > > > is to follow it through. From the engine log where an action > > > > > was > > > > > initiated, down to the hosts that did the execution. > > > > > Everything > > > > > RHEV > > > > > does is a flow, and with no correlation between hosts > > > > > executing > > > > > parts of the same flow, troubleshooting turns into guesswork, > > > > > because the only contact point left is time, which is useless > > > > > when > > > > > you're talking about vdsm - there are sometimes hundreds of > > > > > log > > > > > records in a single second, and not every host is in absolute > > > > > sync > > > > > with every other. > > > > > > > > What are you talking about? you know exactly what operation the > > > > engine ran at vdsm level. > > > > > > Not always true. Haven't had much chance to deeply dive into 3.0 > > > logs, and if things changed there, it's already a huge step in > > > the > > > right direction. > > > > engine logs now always log calls to vdsm. > > finally! :) > > > > > > > > > > If it's a task then you also have a task > > > > id which is a uuid so you don't need anything else. > > > > > > Right, but 1. not everything is a task and > > > > Perhaps that should be fixed... > > > > > 2. tasks spawn other tasks, and need to be followed through for > > > that > > > > In vdsm? no they don't > > Err, ok, in the engine then. The point being, a flow like > MoveMultipleImages (pardon my bad memory for spelling) spawns > several tasks, and even within those tasks there's usually more than > a single flow. > > > > > > 3. long running tasks are hell to debug, because they span > > > several > > > log files and thousands lines of logs > > > > flowid would not solve this in any way. > > All you need to do to see the entire task is grep the taskID as the > > thread name is the taskID > > which causes me to > 1. jump between engine and vdsm logs, comparing task IDs and actions. > 2. do it many times in case of a complex flow > > Seeing the entire thing on RHEV/oVirt as a system would make a better > picture, and being able to grep the entire set of logs for a single > ID is a huge plus, since it would produce one large log that > (hopefully, and if everything is done right) shows the full flow, > step by step, everywhere it took place
Which is never what you really want as it is pretty useless. What you want is the point of failure and a few hundred lines before. The rest is just distraction which is the point of this thread. I have never encountered a case where I needed such a flow to debug an issue (even when it spread across multiple hosts). > > > > > > > > > > In addition, now that engine logs results, you can just grep > > > > that > > > > instead of a flow id and land at the exact correct command and > > > > not > > > > have to figure out which out of the 5 run in this flow is the > > > > relevant one. > > > > > > Haven't seen that yet, but again, what are results? When the > > > failure > > > is somewhere in the middle of the flow, the resulting failure can > > > be > > > totally irrelevant. > > > > When you have the failure message then all you have to do is grep > > it > > in the log and reach the exact call that failed. Just search for > > "ERROR|FAILED" above that and you reach the place of error. > > It's the typical approach to logs, that works for other systems. With > RHEV, just finding the traceback is never the solution, because it > shows where everything finally went offhill, instead of where the > error happened. > > > > > > > > > > > > > > If you could give a real example where this would be beneficial > > > > (i.e. > > > > log excerpts, how you correlated them and how flow id would > > > > have > > > > eased your job) that would be great. > > > > > > Don't have these handy, guess Vladik, who was collecting > > > interesting > > > fail flows could have helped. > > > > > > > > > > Note that I've also discussed this with Yaniv from qe who said > > > > they > > > > don't really need it. > > > > > > I'm not saying I want to see a flow ID as such, what I _am_ > > > saying > > > is > > > that flows are important, and we need an easy way of following > > > them > > > through. > > > When a user starts a process in the engine, it should be clearly > > > logged and marked, then it should consistently report progress > > > and > > > interim outputs, say which host was picked for what action, and > > > how > > > that action can be identified in the vdsm logs. > > > > That's fine (although periodic logging causes log overflow) > > Can't those be aggregated somehow? Or sent into a different log? > There was even an idea of maintaining a local db for logging instead > of plaintext, sqlite would serve for that purpose > > > > > > > > > We cannot rely on timing. We cannot rely on everyone knowing > > > obscure > > > engine/vdsm action naming conventions, that are not exactly the > > > same > > > as they look in the GUI. A person with no understanding of engine > > > and vdsm internals, should be able to easily follow an action > > > through to conclusion, and understand what was done at each step, > > > what the system got in return and how it reacted. And all of > > > these > > > actions should be clearly inter-related, so a single grep can > > > select > > > a flow. > > > > > > If you prefer to go in the other direction, like turning > > > everything > > > into a task, that can also work, but each task, when spawning > > > another, should very clearly show the relation as well, and sets > > > of > > > tasks should be marked as sets. > > > > > > > > > > > > > > > > > > > > > > > > A generic debugging scenario as I see it. > > > > > > > > > > > > 1. Something went wrong > > > > > > 2. You go looking in the ENGINE log trying to figure out > > > > > > what > > > > > > happend. > > > > > > 3. You see that ENGINE got SomeError. > > > > > > > > > > ok, the rest are all downhill. > > > > > > > > > > 4. You follow the failure back to the start of the flow, then > > > > > go > > > > > with > > > > > the flow to the point where the engine exited to vdsm > > > > > 5. switch over to vdsm logs, make sure you have the timing > > > > > right > > > > > (with no flow ID that's the olny orientation after all) > > > > > 6. find the start of the vdsm-side flow, follow it to the > > > > > failure, > > > > > pray the error makes sense. > > > > > > > > > > In many cases the answer is not in the vdsm failure traceback > > > > > but > > > > > somewhere in the middle of the flow, with no errors reported, > > > > > this > > > > > is why we need a way to easily follow things through. > > > > > Moreover, > > > > > the > > > > > logs should be readable enough to make sense to a typical > > > > > sysadmin, > > > > > and not a RHEV expert. > > > > > > > > > > > 4. Check to see if this error makes sense imagining that > > > > > > VDSM > > > > > > is > > > > > > always right and is a black box. > > > > > > 5. You did your digging and now you think that VDSM is as > > > > > > fault. > > > > > > 6. Go look for the call that failed. (If we returned the > > > > > > taskID > > > > > > it's > > > > > > pretty simple to find that call). > > > > > > 7. Look around the call to check VDSM state. > > > > > > 8. Profit. > > > > > > > > > > > > There is never a point where you want to follow a whole > > > > > > flow > > > > > > call > > > > > > by > > > > > > call going back and forth, and even if you did having the > > > > > > VDSM > > > > > > taskID is a better anchor then flowID. > > > > > > > > > > not everything is a task, flow IDs would unify entire flows, > > > > > and > > > > > make > > > > > following them easy. > > > > > > > > > > > > > > > > > VDSM is built in a way that every call takes in to account > > > > > > the > > > > > > current state only. Debugging it with an engine flow > > > > > > mindset > > > > > > is > > > > > > just > > > > > > wrong and distracting. I see it doing more harm the good by > > > > > > reinforcing bad debugging practices. > > > > > > > > > > Maybe you're right, though I can't see how from my experience > > > > > so > > > > > far, > > > > > but following the flows is the only thing that got cases > > > > > resolved. > > > > > Not event IDs making every possible error, and not task IDs > > > > > (though > > > > > these do have their uses) - slow and meticulous mapping of > > > > > flows > > > > > to > > > > > log records. > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Keith Robertson" <krobe...@redhat.com> > > > > > > > To: "VDSM Project Development" > > > > > > > <vdsm-devel@lists.fedorahosted.org> > > > > > > > Sent: Thursday, February 9, 2012 1:34:43 PM > > > > > > > Subject: Re: [vdsm] flowID schema > > > > > > > > > > > > > > On 02/09/2012 12:18 PM, Andrew Cathrow wrote: > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > >> From: "Ayal Baron"<aba...@redhat.com> > > > > > > > >> To: "Dan Kenigsberg"<dan...@redhat.com> > > > > > > > >> Cc: "VDSM Project > > > > > > > >> Development"<vdsm-devel@lists.fedorahosted.org> > > > > > > > >> Sent: Monday, February 6, 2012 10:35:54 AM > > > > > > > >> Subject: Re: [vdsm] flowID schema > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> ----- Original Message ----- > > > > > > > >>> On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi > > > > > > > >>> Mizrahi > > > > > > > >>> wrote: > > > > > > > >>>> flowID makes no sense after the initial API call as > > > > > > > >>>> stuff > > > > > > > >>>> like > > > > > > > >>>> cacheing\threadpools\samplingtasks\resources\asyncTasks > > > > > > > >>>> so > > > > > > > >>>> flowing > > > > > > > >>>> a flow like that will not give you the entire > > > > > > > >>>> picture > > > > > > > >>>> while > > > > > > > >>>> debugging. > > > > > > > >>>> > > > > > > > >>>> Also adding it now will make everything even more > > > > > > > >>>> ugly. > > > > > > > >>>> You know what, just imagine I wrote one of my long > > > > > > > >>>> rambles > > > > > > > >>>> about > > > > > > > >>>> why I don't agree with doing this. > > > > > > > >>> I cannot imagine you write anything like that. > > > > > > > >>> Really. > > > > > > > >>> I > > > > > > > >>> do > > > > > > > >>> not > > > > > > > >>> understand why you object logging flowID on API entry > > > > > > > >>> point. > > > > > > > >> The question is, what problem is this really trying to > > > > > > > >> solve > > > > > > > >> and > > > > > > > >> is > > > > > > > >> there a simpler and less obtrusive solution to that > > > > > > > >> problem? > > > > > > > > correlating logs between ovirt engine and potentially > > > > > > > > multiple > > > > > > > > vdsm > > > > > > > > nodes is a nightmare. It requires a lot skill to follow > > > > > > > > a > > > > > > > > transaction through from the front end all the way to > > > > > > > > the > > > > > > > > node, > > > > > > > > and even multiple nodes (eg actions on spm, then > > > > > > > > actions > > > > > > > > on > > > > > > > > other > > > > > > > > node to run a vm). > > > > > > > > Having a way to correlate the logs and follow a single > > > > > > > > event/flow > > > > > > > > is vital. > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > Knowing what command caused a sequence of events in VDSM > > > > > > > would > > > > > > > be > > > > > > > really > > > > > > > helpful particularly in a threaded environment. Further, > > > > > > > wouldn't > > > > > > > such > > > > > > > an ID be helpful in an asynchronous request/response > > > > > > > model? > > > > > > > I'm > > > > > > > not > > > > > > > sure what the plans are for AMQP or even if there are > > > > > > > plans, > > > > > > > but > > > > > > > I'd > > > > > > > think that something like this would be crucial for an > > > > > > > async > > > > > > > response. > > > > > > > So, if you implemented it you might be killing 2 birds > > > > > > > with > > > > > > > 1 > > > > > > > stone. > > > > > > > > > > > > > > FYI: If you want to see examples of other systems that > > > > > > > use > > > > > > > similar > > > > > > > concepts, take a look at the correlation ID in JMS. > > > > > > > > > > > > > > Cheers, > > > > > > > Keith > > > > > > > > > > > > > > > > > > > > > >>> _______________________________________________ > > > > > > > >>> vdsm-devel mailing list > > > > > > > >>> vdsm-devel@lists.fedorahosted.org > > > > > > > >>> https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > >>> > > > > > > > >> _______________________________________________ > > > > > > > >> vdsm-devel mailing list > > > > > > > >> vdsm-devel@lists.fedorahosted.org > > > > > > > >> https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > >> > > > > > > > > _______________________________________________ > > > > > > > > vdsm-devel mailing list > > > > > > > > vdsm-devel@lists.fedorahosted.org > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > > > > > > > > _______________________________________________ > > > > > > > vdsm-devel mailing list > > > > > > > vdsm-devel@lists.fedorahosted.org > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > > > > > > > _______________________________________________ > > > > > > vdsm-devel mailing list > > > > > > vdsm-devel@lists.fedorahosted.org > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Dan Yasny > > > > > Red Hat Israel > > > > > +972 9769 2280 > > > > > _______________________________________________ > > > > > vdsm-devel mailing list > > > > > vdsm-devel@lists.fedorahosted.org > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Regards, > > > > > > Dan Yasny > > > Red Hat Israel > > > +972 9769 2280 > > > > > > > -- > > > > Regards, > > Dan Yasny > Red Hat Israel > +972 9769 2280 > _______________________________________________ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel