Hi Ayal,

 > > actual use case where someone would have actually
 > > benefited
 > > from
 > > this.


Sorry joining that late into the discussion.

Forget the automatic analyses/past issues recognition that may or may not be 
possible if we do have flow ID. (Personally I believe it's possible for known 
issues)

Think of large setups, not the few hosts setups that are running in the labs.
Dozens of hosts, hundreds of VMs and tasks, running for few months. 
--->>> Huge logs (yes we even got 50G sized logs)

Now try to do what you and Saggy are suggesting said, yes it will work!!! 
Sometimes it is the only approach that will be possible.

But what about efficiency?

Some times I found myself spending 30 minutes just getting to the files that I 
need to view and to the approximate location of the error. 
If you have a flow ID you can have a tool that will extract all the relevant 
information from all the hosts participating in that flow and correlate that 
with the rhevm.log as the anchor.

(Valdik actually wrote a tool that follows the SPM and creates an SPM log no 
matter on which host it resides - this allowed tracking flows and proved itself 
useful and time saving, especially since when storage issues begin SPM tends to 
move around :))

In addition to that think about skill levels, 
For you it's very easy to go and debug, you wrote the code, you know what to 
expect and what is the normal behavior (most of the time). But what about 
others?
When encountering an issue within a flow for a front line or a customer self 
debug the easiest way will be to actually compare it to the same flow that 
succeed (if exist) and understand what was supposed to happen next (or before) 
and did not. I actually used that technique a lot. Again a flow ID will save 
time. All you have to do is extract an identical flow that worked using the 
flow-extractor utility and then compare.


I agree that when you get to the really tough cases, they will span across 
flows and your only alternative may be just to dig in. However for many others, 
flows tracking will save support calls, support escalations and will 
considerably help with education. For 2.2 we had Marina going into the code and 
documenting flows, so support people can learn what to expect and what to look 
for in the logs.

I believe it will also help engineering, lot's of times I had to gather 3 
engineers and sat together just to understand a flow in order to debug an issue 
or understand what to expect if someone complained about a weird behavior. 
Since each of them only new what his part is doing - all had to go into their 
code and discuss. With flow ID you can just extract the flow and easily see 
what is being done (you may find out as happened before that it's not exactly 
what they meant to do) 

In short, if we aim to look at RHEV as a system, flows are helpful in many 
aspects. Instead of having discrete components and requiring a high level of 
expertise and spending a lot of time (finding the right spot) to debug even 
minor issues you can look at a system and easily get (most of the time), 
understand what to expect and get to the most relevant logs fairly fast.

Simon.










----- Original Message -----
> From: "Dan Yasny" <dya...@redhat.com>
> To: "Ayal Baron" <aba...@redhat.com>
> Cc: "Simon Grinberg" <si...@redhat.com>, vdsm-devel@lists.fedorahosted.org
> Sent: Friday, February 10, 2012 4:01:01 PM
> Subject: Re: [vdsm] flowID schema
> 
> 
> 
> ----- Original Message -----
> > From: "Ayal Baron" <aba...@redhat.com>
> > To: "Dan Yasny" <dya...@redhat.com>
> > Cc: "Simon Grinberg" <si...@redhat.com>,
> > vdsm-devel@lists.fedorahosted.org
> > Sent: Friday, 10 February, 2012 3:51:01 PM
> > Subject: Re: [vdsm] flowID schema
> > 
> > 
> > 
> > ----- Original Message -----
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > vdsm-devel@lists.fedorahosted.org
> > > > Sent: Friday, 10 February, 2012 2:55:46 PM
> > > > Subject: Re: [vdsm] flowID schema
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > Sent: Friday, 10 February, 2012 12:50:04 AM
> > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > > From: "Saggi Mizrahi" <smizr...@redhat.com>
> > > > > > > > To: "Keith Robertson" <krobe...@redhat.com>
> > > > > > > > Cc: "VDSM Project Development"
> > > > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > > > Sent: Thursday, February 9, 2012 2:24:44 PM
> > > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > >
> > > > > > > > -1
> > > > > > > >
> > > > > > > > I agree that for messaging environment having a Message
> > > > > > > > ID
> > > > > > > > is
> > > > > > > > a
> > > > > > > > must
> > > > > > > > because you sometimes don't have a particular target so
> > > > > > > > when
> > > > > > > > you
> > > > > > > > get
> > > > > > > > a response you need to know what this node is actually
> > > > > > > > responding
> > > > > > > > to.
> > > > > > > >
> > > > > > > > The message ID could be composed with <FLOWID><MSGID>
> > > > > > > > so
> > > > > > > > you
> > > > > > > > can
> > > > > > > > reuse the field.
> > > > > > > >
> > > > > > > > But that is all besides the point.
> > > > > > > >
> > > > > > > > I understand that someone might find it fun to go on
> > > > > > > > following
> > > > > > > > the
> > > > > > > > entire flow in the Engine and in VDSM. But I would like
> > > > > > > > to
> > > > > > > > hear
> > > > > > > > an
> > > > > > > > actual use case where someone would have actually
> > > > > > > > benefited
> > > > > > > > from
> > > > > > > > this.
> > > > > > > > As I see it having VSDM return the task ID with every
> > > > > > > > response
> > > > > > > > (and
> > > > > > > > not just for async tasks) is a lot more useful and
> > > > > > > > correct.
> > > > > > > 
> > > > > > > Actually, the only way to understand what happened in a
> > > > > > > certain
> > > > > > > flow
> > > > > > > is to follow it through. From the engine log where an
> > > > > > > action
> > > > > > > was
> > > > > > > initiated, down to the hosts that did the execution.
> > > > > > > Everything
> > > > > > > RHEV
> > > > > > > does is a flow, and with no correlation between hosts
> > > > > > > executing
> > > > > > > parts of the same flow, troubleshooting turns into
> > > > > > > guesswork,
> > > > > > > because the only contact point left is time, which is
> > > > > > > useless
> > > > > > > when
> > > > > > > you're talking about vdsm - there are sometimes hundreds
> > > > > > > of
> > > > > > > log
> > > > > > > records in a single second, and not every host is in
> > > > > > > absolute
> > > > > > > sync
> > > > > > > with every other.
> > > > > > 
> > > > > > What are you talking about? you know exactly what operation
> > > > > > the
> > > > > > engine ran at vdsm level.
> > > > > 
> > > > > Not always true. Haven't had much chance to deeply dive into
> > > > > 3.0
> > > > > logs, and if things changed there, it's already a huge step
> > > > > in
> > > > > the
> > > > > right direction.
> > > > 
> > > > engine logs now always log calls to vdsm.
> > > 
> > > finally! :)
> > > 
> > > > 
> > > > > 
> > > > > > If it's a task then you also have a task
> > > > > > id which is a uuid so you don't need anything else.
> > > > > 
> > > > > Right, but 1. not everything is a task and
> > > > 
> > > > Perhaps that should be fixed...
> > > > 
> > > > > 2. tasks spawn other tasks, and need to be followed through
> > > > > for
> > > > > that
> > > > 
> > > > In vdsm? no they don't
> > > 
> > > Err, ok, in the engine then. The point being, a flow like
> > > MoveMultipleImages (pardon my bad memory for spelling) spawns
> > > several tasks, and even within those tasks there's usually more
> > > than
> > > a single flow.
> > > 
> > > > 
> > > > > 3. long running tasks are hell to debug, because they span
> > > > > several
> > > > > log files and thousands lines of logs
> > > > 
> > > > flowid would not solve this in any way.
> > > > All you need to do to see the entire task is grep the taskID as
> > > > the
> > > > thread name is the taskID
> > > 
> > > which causes me to
> > > 1. jump between engine and vdsm logs, comparing task IDs and
> > > actions.
> > > 2. do it many times in case of a complex flow
> > > 
> > > Seeing the entire thing on RHEV/oVirt as a system would make a
> > > better
> > > picture, and being able to grep the entire set of logs for a
> > > single
> > > ID is a huge plus, since it would produce one large log that
> > > (hopefully, and if everything is done right) shows the full flow,
> > > step by step, everywhere it took place
> > 
> > Which is never what you really want as it is pretty useless.  What
> > you want is the point of failure and a few hundred lines before.
> 
> The problem is - the point of failure is not always the point with
> the python traceback. This is why a flow should be follow-able
> 
> >  The rest is just distraction which is the point of this thread.  I
> > have never encountered a case where I needed such a flow to debug
> > an
> > issue (even when it spread across multiple hosts).
> 
> No, the rest is an indication of flow progress, which can help with
> troubleshooting of issues that don't end in an error. Like
> performance and scalability issues
> 
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > In addition, now that engine logs results, you can just
> > > > > > grep
> > > > > > that
> > > > > > instead of a flow id and land at the exact correct command
> > > > > > and
> > > > > > not
> > > > > > have to figure out which out of the 5 run in this flow is
> > > > > > the
> > > > > > relevant one.
> > > > > 
> > > > > Haven't seen that yet, but again, what are results? When the
> > > > > failure
> > > > > is somewhere in the middle of the flow, the resulting failure
> > > > > can
> > > > > be
> > > > > totally irrelevant.
> > > > 
> > > > When you have the failure message then all you have to do is
> > > > grep
> > > > it
> > > > in the log and reach the exact call that failed.  Just search
> > > > for
> > > > "ERROR|FAILED" above that and you reach the place of error.
> > > 
> > > It's the typical approach to logs, that works for other systems.
> > > With
> > > RHEV, just finding the traceback is never the solution, because
> > > it
> > > shows where everything finally went offhill, instead of where the
> > > error happened.
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > If you could give a real example where this would be
> > > > > > beneficial
> > > > > > (i.e.
> > > > > > log excerpts, how you correlated them and how flow id would
> > > > > > have
> > > > > > eased your job) that would be great.
> > > > > 
> > > > > Don't have these handy, guess Vladik, who was collecting
> > > > > interesting
> > > > > fail flows could have helped.
> > > > > 
> > > > > 
> > > > > > Note that I've also discussed this with Yaniv from qe who
> > > > > > said
> > > > > > they
> > > > > > don't really need it.
> > > > > 
> > > > > I'm not saying I want to see a flow ID as such, what I _am_
> > > > > saying
> > > > > is
> > > > > that flows are important, and we need an easy way of
> > > > > following
> > > > > them
> > > > > through.
> > > > > When a user starts a process in the engine, it should be
> > > > > clearly
> > > > > logged and marked, then it should consistently report
> > > > > progress
> > > > > and
> > > > > interim outputs, say which host was picked for what action,
> > > > > and
> > > > > how
> > > > > that action can be identified in the vdsm logs.
> > > > 
> > > > That's fine (although periodic logging causes log overflow)
> > > 
> > > Can't those be aggregated somehow? Or sent into a different log?
> > > There was even an idea of maintaining a local db for logging
> > > instead
> > > of plaintext, sqlite would serve for that purpose
> > > 
> > > > 
> > > > > 
> > > > > We cannot rely on timing. We cannot rely on everyone knowing
> > > > > obscure
> > > > > engine/vdsm action naming conventions, that are not exactly
> > > > > the
> > > > > same
> > > > > as they look in the GUI. A person with no understanding of
> > > > > engine
> > > > > and vdsm internals, should be able to easily follow an action
> > > > > through to conclusion, and understand what was done at each
> > > > > step,
> > > > > what the system got in return and how it reacted. And all of
> > > > > these
> > > > > actions should be clearly inter-related, so a single grep can
> > > > > select
> > > > > a flow.
> > > > > 
> > > > > If you prefer to go in the other direction, like turning
> > > > > everything
> > > > > into a task, that can also work, but each task, when spawning
> > > > > another, should very clearly show the relation as well, and
> > > > > sets
> > > > > of
> > > > > tasks should be marked as sets.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > >
> > > > > > > > A generic debugging scenario as I see it.
> > > > > > > >
> > > > > > > > 1. Something went wrong
> > > > > > > > 2. You go looking in the ENGINE log trying to figure
> > > > > > > > out
> > > > > > > > what
> > > > > > > > happend.
> > > > > > > > 3. You see that ENGINE got SomeError.
> > > > > > > 
> > > > > > > ok, the rest are all downhill.
> > > > > > > 
> > > > > > > 4. You follow the failure back to the start of the flow,
> > > > > > > then
> > > > > > > go
> > > > > > > with
> > > > > > > the flow to the point where the engine exited to vdsm
> > > > > > > 5. switch over to vdsm logs, make sure you have the
> > > > > > > timing
> > > > > > > right
> > > > > > > (with no flow ID that's the olny orientation after all)
> > > > > > > 6. find the start of the vdsm-side flow, follow it to the
> > > > > > > failure,
> > > > > > > pray the error makes sense.
> > > > > > > 
> > > > > > > In many cases the answer is not in the vdsm failure
> > > > > > > traceback
> > > > > > > but
> > > > > > > somewhere in the middle of the flow, with no errors
> > > > > > > reported,
> > > > > > > this
> > > > > > > is why we need a way to easily follow things through.
> > > > > > > Moreover,
> > > > > > > the
> > > > > > > logs should be readable enough to make sense to a typical
> > > > > > > sysadmin,
> > > > > > > and not a RHEV expert.
> > > > > > > 
> > > > > > > > 4. Check to see if this error makes sense imagining
> > > > > > > > that
> > > > > > > > VDSM
> > > > > > > > is
> > > > > > > > always right and is a black box.
> > > > > > > > 5. You did your digging and now you think that VDSM is
> > > > > > > > as
> > > > > > > > fault.
> > > > > > > > 6. Go look for the call that failed. (If we returned
> > > > > > > > the
> > > > > > > > taskID
> > > > > > > > it's
> > > > > > > > pretty simple to find that call).
> > > > > > > > 7. Look around the call to check VDSM state.
> > > > > > > > 8. Profit.
> > > > > > > >
> > > > > > > > There is never a point where you want to follow a whole
> > > > > > > > flow
> > > > > > > > call
> > > > > > > > by
> > > > > > > > call going back and forth, and even if you did having
> > > > > > > > the
> > > > > > > > VDSM
> > > > > > > > taskID is a better anchor then flowID.
> > > > > > > 
> > > > > > > not everything is a task, flow IDs would unify entire
> > > > > > > flows,
> > > > > > > and
> > > > > > > make
> > > > > > > following them easy.
> > > > > > > 
> > > > > > > >
> > > > > > > > VDSM is built in a way that every call takes in to
> > > > > > > > account
> > > > > > > > the
> > > > > > > > current state only. Debugging it with an engine flow
> > > > > > > > mindset
> > > > > > > > is
> > > > > > > > just
> > > > > > > > wrong and distracting. I see it doing more harm the
> > > > > > > > good
> > > > > > > > by
> > > > > > > > reinforcing bad debugging practices.
> > > > > > > 
> > > > > > > Maybe you're right, though I can't see how from my
> > > > > > > experience
> > > > > > > so
> > > > > > > far,
> > > > > > > but following the flows is the only thing that got cases
> > > > > > > resolved.
> > > > > > > Not event IDs making every possible error, and not task
> > > > > > > IDs
> > > > > > > (though
> > > > > > > these do have their uses) - slow and meticulous mapping
> > > > > > > of
> > > > > > > flows
> > > > > > > to
> > > > > > > log records.
> > > > > > > 
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Keith Robertson" <krobe...@redhat.com>
> > > > > > > > > To: "VDSM Project Development"
> > > > > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > > > > Sent: Thursday, February 9, 2012 1:34:43 PM
> > > > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > > >
> > > > > > > > > On 02/09/2012 12:18 PM, Andrew Cathrow wrote:
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > >> From: "Ayal Baron"<aba...@redhat.com>
> > > > > > > > > >> To: "Dan Kenigsberg"<dan...@redhat.com>
> > > > > > > > > >> Cc: "VDSM Project
> > > > > > > > > >> Development"<vdsm-devel@lists.fedorahosted.org>
> > > > > > > > > >> Sent: Monday, February 6, 2012 10:35:54 AM
> > > > > > > > > >> Subject: Re: [vdsm] flowID schema
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> ----- Original Message -----
> > > > > > > > > >>> On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi
> > > > > > > > > >>> Mizrahi
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>> flowID makes no sense after the initial API call
> > > > > > > > > >>>> as
> > > > > > > > > >>>> stuff
> > > > > > > > > >>>> like
> > > > > > > > > >>>> cacheing\threadpools\samplingtasks\resources\asyncTasks
> > > > > > > > > >>>> so
> > > > > > > > > >>>> flowing
> > > > > > > > > >>>> a flow like that will not give you the entire
> > > > > > > > > >>>> picture
> > > > > > > > > >>>> while
> > > > > > > > > >>>> debugging.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Also adding it now will make everything even
> > > > > > > > > >>>> more
> > > > > > > > > >>>> ugly.
> > > > > > > > > >>>> You know what, just imagine I wrote one of my
> > > > > > > > > >>>> long
> > > > > > > > > >>>> rambles
> > > > > > > > > >>>> about
> > > > > > > > > >>>> why I don't agree with doing this.
> > > > > > > > > >>> I cannot imagine you write anything like that.
> > > > > > > > > >>> Really.
> > > > > > > > > >>> I
> > > > > > > > > >>> do
> > > > > > > > > >>> not
> > > > > > > > > >>> understand why you object logging flowID on API
> > > > > > > > > >>> entry
> > > > > > > > > >>> point.
> > > > > > > > > >> The question is, what problem is this really
> > > > > > > > > >> trying
> > > > > > > > > >> to
> > > > > > > > > >> solve
> > > > > > > > > >> and
> > > > > > > > > >> is
> > > > > > > > > >> there a simpler and less obtrusive solution to
> > > > > > > > > >> that
> > > > > > > > > >> problem?
> > > > > > > > > > correlating logs between ovirt engine and
> > > > > > > > > > potentially
> > > > > > > > > > multiple
> > > > > > > > > > vdsm
> > > > > > > > > > nodes is a nightmare. It requires a lot skill to
> > > > > > > > > > follow
> > > > > > > > > > a
> > > > > > > > > > transaction through from the front end all the way
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > node,
> > > > > > > > > > and even multiple nodes (eg actions on spm, then
> > > > > > > > > > actions
> > > > > > > > > > on
> > > > > > > > > > other
> > > > > > > > > > node to run a vm).
> > > > > > > > > > Having a way to correlate the logs and follow a
> > > > > > > > > > single
> > > > > > > > > > event/flow
> > > > > > > > > > is vital.
> > > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Knowing what command caused a sequence of events in
> > > > > > > > > VDSM
> > > > > > > > > would
> > > > > > > > > be
> > > > > > > > > really
> > > > > > > > > helpful particularly in a threaded environment.
> > > > > > > > >  Further,
> > > > > > > > > wouldn't
> > > > > > > > > such
> > > > > > > > > an ID be helpful in an asynchronous request/response
> > > > > > > > > model?
> > > > > > > > >  I'm
> > > > > > > > > not
> > > > > > > > > sure what the plans are for AMQP or even if there are
> > > > > > > > > plans,
> > > > > > > > > but
> > > > > > > > > I'd
> > > > > > > > > think that something like this would be crucial for
> > > > > > > > > an
> > > > > > > > > async
> > > > > > > > > response.
> > > > > > > > > So, if you implemented it you might be killing 2
> > > > > > > > > birds
> > > > > > > > > with
> > > > > > > > > 1
> > > > > > > > > stone.
> > > > > > > > >
> > > > > > > > > FYI: If you want to see examples of other systems
> > > > > > > > > that
> > > > > > > > > use
> > > > > > > > > similar
> > > > > > > > > concepts, take a look at the correlation ID in JMS.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Keith
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >>> _______________________________________________
> > > > > > > > > >>> vdsm-devel mailing list
> > > > > > > > > >>> vdsm-devel@lists.fedorahosted.org
> > > > > > > > > >>> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > >>>
> > > > > > > > > >> _______________________________________________
> > > > > > > > > >> vdsm-devel mailing list
> > > > > > > > > >> vdsm-devel@lists.fedorahosted.org
> > > > > > > > > >> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > >>
> > > > > > > > > > _______________________________________________
> > > > > > > > > > vdsm-devel mailing list
> > > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > vdsm-devel mailing list
> > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > vdsm-devel mailing list
> > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Regards,
> > > > > > > 
> > > > > > > Dan Yasny
> > > > > > > Red Hat Israel
> > > > > > > +972 9769 2280
> > > > > > > _______________________________________________
> > > > > > > vdsm-devel mailing list
> > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > --
> > > > > 
> > > > > 
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Dan Yasny
> > > > > Red Hat Israel
> > > > > +972 9769 2280
> > > > > 
> > > > 
> > > 
> > > --
> > > 
> > > 
> > > 
> > > Regards,
> > > 
> > > Dan Yasny
> > > Red Hat Israel
> > > +972 9769 2280
> > > 
> > 
> 
> --
> 
> 
> 
> Regards,
> 
> Dan Yasny
> Red Hat Israel
> +972 9769 2280
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/vdsm-devel

Reply via email to