Re: [vdsm] flowID schema

Ayal Baron Fri, 10 Feb 2012 05:51:16 -0800


----- Original Message -----
> 
> 
> ----- Original Message -----
> > From: "Ayal Baron" <aba...@redhat.com>
> > To: "Dan Yasny" <dya...@redhat.com>
> > Cc: "Simon Grinberg" <si...@redhat.com>,
> > vdsm-devel@lists.fedorahosted.org
> > Sent: Friday, 10 February, 2012 2:55:46 PM
> > Subject: Re: [vdsm] flowID schema
> > 
> > 
> > 
> > ----- Original Message -----
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > vdsm-devel@lists.fedorahosted.org
> > > > Sent: Friday, 10 February, 2012 12:50:04 AM
> > > > Subject: Re: [vdsm] flowID schema
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > > From: "Saggi Mizrahi" <smizr...@redhat.com>
> > > > > > To: "Keith Robertson" <krobe...@redhat.com>
> > > > > > Cc: "VDSM Project Development"
> > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > Sent: Thursday, February 9, 2012 2:24:44 PM
> > > > > > Subject: Re: [vdsm] flowID schema
> > > > > >
> > > > > > -1
> > > > > >
> > > > > > I agree that for messaging environment having a Message ID
> > > > > > is
> > > > > > a
> > > > > > must
> > > > > > because you sometimes don't have a particular target so
> > > > > > when
> > > > > > you
> > > > > > get
> > > > > > a response you need to know what this node is actually
> > > > > > responding
> > > > > > to.
> > > > > >
> > > > > > The message ID could be composed with <FLOWID><MSGID> so
> > > > > > you
> > > > > > can
> > > > > > reuse the field.
> > > > > >
> > > > > > But that is all besides the point.
> > > > > >
> > > > > > I understand that someone might find it fun to go on
> > > > > > following
> > > > > > the
> > > > > > entire flow in the Engine and in VDSM. But I would like to
> > > > > > hear
> > > > > > an
> > > > > > actual use case where someone would have actually benefited
> > > > > > from
> > > > > > this.
> > > > > > As I see it having VSDM return the task ID with every
> > > > > > response
> > > > > > (and
> > > > > > not just for async tasks) is a lot more useful and correct.
> > > > > 
> > > > > Actually, the only way to understand what happened in a
> > > > > certain
> > > > > flow
> > > > > is to follow it through. From the engine log where an action
> > > > > was
> > > > > initiated, down to the hosts that did the execution.
> > > > > Everything
> > > > > RHEV
> > > > > does is a flow, and with no correlation between hosts
> > > > > executing
> > > > > parts of the same flow, troubleshooting turns into guesswork,
> > > > > because the only contact point left is time, which is useless
> > > > > when
> > > > > you're talking about vdsm - there are sometimes hundreds of
> > > > > log
> > > > > records in a single second, and not every host is in absolute
> > > > > sync
> > > > > with every other.
> > > > 
> > > > What are you talking about? you know exactly what operation the
> > > > engine ran at vdsm level.
> > > 
> > > Not always true. Haven't had much chance to deeply dive into 3.0
> > > logs, and if things changed there, it's already a huge step in
> > > the
> > > right direction.
> > 
> > engine logs now always log calls to vdsm.
> 
> finally! :)
> 
> > 
> > > 
> > > > If it's a task then you also have a task
> > > > id which is a uuid so you don't need anything else.
> > > 
> > > Right, but 1. not everything is a task and
> > 
> > Perhaps that should be fixed...
> > 
> > > 2. tasks spawn other tasks, and need to be followed through for
> > > that
> > 
> > In vdsm? no they don't
> 
> Err, ok, in the engine then. The point being, a flow like
> MoveMultipleImages (pardon my bad memory for spelling) spawns
> several tasks, and even within those tasks there's usually more than
> a single flow.
> 
> > 
> > > 3. long running tasks are hell to debug, because they span
> > > several
> > > log files and thousands lines of logs
> > 
> > flowid would not solve this in any way.
> > All you need to do to see the entire task is grep the taskID as the
> > thread name is the taskID
> 
> which causes me to
> 1. jump between engine and vdsm logs, comparing task IDs and actions.
> 2. do it many times in case of a complex flow
> 
> Seeing the entire thing on RHEV/oVirt as a system would make a better
> picture, and being able to grep the entire set of logs for a single
> ID is a huge plus, since it would produce one large log that
> (hopefully, and if everything is done right) shows the full flow,
> step by step, everywhere it took place


Which is never what you really want as it is pretty useless.  What you want is 
the point of failure and a few hundred lines before.  The rest is just 
distraction which is the point of this thread.  I have never encountered a case 
where I needed such a flow to debug an issue (even when it spread across 
multiple hosts).

> 
> > 
> > > 
> > > > In addition, now that engine logs results, you can just grep
> > > > that
> > > > instead of a flow id and land at the exact correct command and
> > > > not
> > > > have to figure out which out of the 5 run in this flow is the
> > > > relevant one.
> > > 
> > > Haven't seen that yet, but again, what are results? When the
> > > failure
> > > is somewhere in the middle of the flow, the resulting failure can
> > > be
> > > totally irrelevant.
> > 
> > When you have the failure message then all you have to do is grep
> > it
> > in the log and reach the exact call that failed.  Just search for
> > "ERROR|FAILED" above that and you reach the place of error.
> 
> It's the typical approach to logs, that works for other systems. With
> RHEV, just finding the traceback is never the solution, because it
> shows where everything finally went offhill, instead of where the
> error happened.
> 
> > 
> > > 
> > > > 
> > > > If you could give a real example where this would be beneficial
> > > > (i.e.
> > > > log excerpts, how you correlated them and how flow id would
> > > > have
> > > > eased your job) that would be great.
> > > 
> > > Don't have these handy, guess Vladik, who was collecting
> > > interesting
> > > fail flows could have helped.
> > > 
> > > 
> > > > Note that I've also discussed this with Yaniv from qe who said
> > > > they
> > > > don't really need it.
> > > 
> > > I'm not saying I want to see a flow ID as such, what I _am_
> > > saying
> > > is
> > > that flows are important, and we need an easy way of following
> > > them
> > > through.
> > > When a user starts a process in the engine, it should be clearly
> > > logged and marked, then it should consistently report progress
> > > and
> > > interim outputs, say which host was picked for what action, and
> > > how
> > > that action can be identified in the vdsm logs.
> > 
> > That's fine (although periodic logging causes log overflow)
> 
> Can't those be aggregated somehow? Or sent into a different log?
> There was even an idea of maintaining a local db for logging instead
> of plaintext, sqlite would serve for that purpose
> 
> > 
> > > 
> > > We cannot rely on timing. We cannot rely on everyone knowing
> > > obscure
> > > engine/vdsm action naming conventions, that are not exactly the
> > > same
> > > as they look in the GUI. A person with no understanding of engine
> > > and vdsm internals, should be able to easily follow an action
> > > through to conclusion, and understand what was done at each step,
> > > what the system got in return and how it reacted. And all of
> > > these
> > > actions should be clearly inter-related, so a single grep can
> > > select
> > > a flow.
> > > 
> > > If you prefer to go in the other direction, like turning
> > > everything
> > > into a task, that can also work, but each task, when spawning
> > > another, should very clearly show the relation as well, and sets
> > > of
> > > tasks should be marked as sets.
> > > 
> > > > 
> > > > > 
> > > > > >
> > > > > > A generic debugging scenario as I see it.
> > > > > >
> > > > > > 1. Something went wrong
> > > > > > 2. You go looking in the ENGINE log trying to figure out
> > > > > > what
> > > > > > happend.
> > > > > > 3. You see that ENGINE got SomeError.
> > > > > 
> > > > > ok, the rest are all downhill.
> > > > > 
> > > > > 4. You follow the failure back to the start of the flow, then
> > > > > go
> > > > > with
> > > > > the flow to the point where the engine exited to vdsm
> > > > > 5. switch over to vdsm logs, make sure you have the timing
> > > > > right
> > > > > (with no flow ID that's the olny orientation after all)
> > > > > 6. find the start of the vdsm-side flow, follow it to the
> > > > > failure,
> > > > > pray the error makes sense.
> > > > > 
> > > > > In many cases the answer is not in the vdsm failure traceback
> > > > > but
> > > > > somewhere in the middle of the flow, with no errors reported,
> > > > > this
> > > > > is why we need a way to easily follow things through.
> > > > > Moreover,
> > > > > the
> > > > > logs should be readable enough to make sense to a typical
> > > > > sysadmin,
> > > > > and not a RHEV expert.
> > > > > 
> > > > > > 4. Check to see if this error makes sense imagining that
> > > > > > VDSM
> > > > > > is
> > > > > > always right and is a black box.
> > > > > > 5. You did your digging and now you think that VDSM is as
> > > > > > fault.
> > > > > > 6. Go look for the call that failed. (If we returned the
> > > > > > taskID
> > > > > > it's
> > > > > > pretty simple to find that call).
> > > > > > 7. Look around the call to check VDSM state.
> > > > > > 8. Profit.
> > > > > >
> > > > > > There is never a point where you want to follow a whole
> > > > > > flow
> > > > > > call
> > > > > > by
> > > > > > call going back and forth, and even if you did having the
> > > > > > VDSM
> > > > > > taskID is a better anchor then flowID.
> > > > > 
> > > > > not everything is a task, flow IDs would unify entire flows,
> > > > > and
> > > > > make
> > > > > following them easy.
> > > > > 
> > > > > >
> > > > > > VDSM is built in a way that every call takes in to account
> > > > > > the
> > > > > > current state only. Debugging it with an engine flow
> > > > > > mindset
> > > > > > is
> > > > > > just
> > > > > > wrong and distracting. I see it doing more harm the good by
> > > > > > reinforcing bad debugging practices.
> > > > > 
> > > > > Maybe you're right, though I can't see how from my experience
> > > > > so
> > > > > far,
> > > > > but following the flows is the only thing that got cases
> > > > > resolved.
> > > > > Not event IDs making every possible error, and not task IDs
> > > > > (though
> > > > > these do have their uses) - slow and meticulous mapping of
> > > > > flows
> > > > > to
> > > > > log records.
> > > > > 
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Keith Robertson" <krobe...@redhat.com>
> > > > > > > To: "VDSM Project Development"
> > > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > > Sent: Thursday, February 9, 2012 1:34:43 PM
> > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > >
> > > > > > > On 02/09/2012 12:18 PM, Andrew Cathrow wrote:
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > >> From: "Ayal Baron"<aba...@redhat.com>
> > > > > > > >> To: "Dan Kenigsberg"<dan...@redhat.com>
> > > > > > > >> Cc: "VDSM Project
> > > > > > > >> Development"<vdsm-devel@lists.fedorahosted.org>
> > > > > > > >> Sent: Monday, February 6, 2012 10:35:54 AM
> > > > > > > >> Subject: Re: [vdsm] flowID schema
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> ----- Original Message -----
> > > > > > > >>> On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi
> > > > > > > >>> Mizrahi
> > > > > > > >>> wrote:
> > > > > > > >>>> flowID makes no sense after the initial API call as
> > > > > > > >>>> stuff
> > > > > > > >>>> like
> > > > > > > >>>> cacheing\threadpools\samplingtasks\resources\asyncTasks
> > > > > > > >>>> so
> > > > > > > >>>> flowing
> > > > > > > >>>> a flow like that will not give you the entire
> > > > > > > >>>> picture
> > > > > > > >>>> while
> > > > > > > >>>> debugging.
> > > > > > > >>>>
> > > > > > > >>>> Also adding it now will make everything even more
> > > > > > > >>>> ugly.
> > > > > > > >>>> You know what, just imagine I wrote one of my long
> > > > > > > >>>> rambles
> > > > > > > >>>> about
> > > > > > > >>>> why I don't agree with doing this.
> > > > > > > >>> I cannot imagine you write anything like that.
> > > > > > > >>> Really.
> > > > > > > >>> I
> > > > > > > >>> do
> > > > > > > >>> not
> > > > > > > >>> understand why you object logging flowID on API entry
> > > > > > > >>> point.
> > > > > > > >> The question is, what problem is this really trying to
> > > > > > > >> solve
> > > > > > > >> and
> > > > > > > >> is
> > > > > > > >> there a simpler and less obtrusive solution to that
> > > > > > > >> problem?
> > > > > > > > correlating logs between ovirt engine and potentially
> > > > > > > > multiple
> > > > > > > > vdsm
> > > > > > > > nodes is a nightmare. It requires a lot skill to follow
> > > > > > > > a
> > > > > > > > transaction through from the front end all the way to
> > > > > > > > the
> > > > > > > > node,
> > > > > > > > and even multiple nodes (eg actions on spm, then
> > > > > > > > actions
> > > > > > > > on
> > > > > > > > other
> > > > > > > > node to run a vm).
> > > > > > > > Having a way to correlate the logs and follow a single
> > > > > > > > event/flow
> > > > > > > > is vital.
> > > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Knowing what command caused a sequence of events in VDSM
> > > > > > > would
> > > > > > > be
> > > > > > > really
> > > > > > > helpful particularly in a threaded environment.  Further,
> > > > > > > wouldn't
> > > > > > > such
> > > > > > > an ID be helpful in an asynchronous request/response
> > > > > > > model?
> > > > > > >  I'm
> > > > > > > not
> > > > > > > sure what the plans are for AMQP or even if there are
> > > > > > > plans,
> > > > > > > but
> > > > > > > I'd
> > > > > > > think that something like this would be crucial for an
> > > > > > > async
> > > > > > > response.
> > > > > > > So, if you implemented it you might be killing 2 birds
> > > > > > > with
> > > > > > > 1
> > > > > > > stone.
> > > > > > >
> > > > > > > FYI: If you want to see examples of other systems that
> > > > > > > use
> > > > > > > similar
> > > > > > > concepts, take a look at the correlation ID in JMS.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Keith
> > > > > > >
> > > > > > >
> > > > > > > >>> _______________________________________________
> > > > > > > >>> vdsm-devel mailing list
> > > > > > > >>> vdsm-devel@lists.fedorahosted.org
> > > > > > > >>> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > >>>
> > > > > > > >> _______________________________________________
> > > > > > > >> vdsm-devel mailing list
> > > > > > > >> vdsm-devel@lists.fedorahosted.org
> > > > > > > >> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > >>
> > > > > > > > _______________________________________________
> > > > > > > > vdsm-devel mailing list
> > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > vdsm-devel mailing list
> > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > >
> > > > > > _______________________________________________
> > > > > > vdsm-devel mailing list
> > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > >
> > > > > 
> > > > > 
> > > > > --
> > > > > 
> > > > > 
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Dan Yasny
> > > > > Red Hat Israel
> > > > > +972 9769 2280
> > > > > _______________________________________________
> > > > > vdsm-devel mailing list
> > > > > vdsm-devel@lists.fedorahosted.org
> > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > 
> > > > 
> > > 
> > > --
> > > 
> > > 
> > > 
> > > Regards,
> > > 
> > > Dan Yasny
> > > Red Hat Israel
> > > +972 9769 2280
> > > 
> > 
> 
> --
> 
> 
> 
> Regards,
> 
> Dan Yasny
> Red Hat Israel
> +972 9769 2280
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] flowID schema

Reply via email to