Re: [vdsm] flowID schema

Saggi Mizrahi Thu, 16 Feb 2012 09:04:10 -0800

Pinning down specific moments where stuff went horribly wrong is usually quite 
simple. The only reason I can't think of for someone to track a problem is if 
there was a bug in the storage subsystem that cause a corruption that only 
became apparent later on. But this can't benefit from flowID.


I'm going to to day this again. I want to hear someone to give a case where 
this is useful. Don't say things like.
"It will allow me to do X"
or
"I know of a guy who spent 2 years doing X to VDSM\Engine logs"

I want an example like:

The user complained about X. So I had to do Y to figure out what is wrong and 
it was a pain.

The Only reason I can think of is someone trying to figure out what wrong 
knowing nothing about how VDSM\Engine works.
While I just jump from place to place because I know what is going on, other 
people would just want to go step by step to get the complete picture.

But even with that I don't see why you would want to keep going from the Engine 
down to VDSM and back out again. 

Other then that I just simply can't imagine a use case where I'd need to do the 
stuff you talk about.
But again, I usually get my bug reports from QA and QE, and they are more 
skilled at bug reporting then the general public.

Further more, WHAT IS A FLOW? When does it start? When does it end? create 
Image is a flow? is the connect to the domain included? isn't it all just part 
of a big flow to create a VM? Does the engine even track it as a flow?

Flows depend on the debugger the problem and the scope. Throwing another 
useless ID in the pile will give you nothing. What you want is to be able to 
map sophisticated connections between resources and operations inside every 
component and between them. For instance, in VDSM. To track a resource locking 
issues you use the resourceID. It crosses "flows". To debug connection issues 
you use the connection information (and soon, the reference ID). An it crosses 
"flows" as well.

I'm not saying that debugging Ovirt is easy. I'm just saying that this is not 
the solution IMHO. Good anchors to resources like taskIDs, connection reference 
IDs, domain IDs and resource IDs give you the ability to track whatever you 
want. You just need better tools to cross reference them across log files so 
that the log tricks I *know* are implemented in a way that everyone can use 
them.

This is when good tools come in to play. Think about it like a DB. Should 
something be a table with an index or should it just be a view.

(more inline)

----- Original Message -----
> From: "Simon Grinberg" <si...@redhat.com>
> To: "Saggi Mizrahi" <smizr...@redhat.com>
> Cc: vdsm-devel@lists.fedorahosted.org, "Dan Yasny" <dya...@redhat.com>, "Ayal 
> Baron" <aba...@redhat.com>
> Sent: Thursday, February 16, 2012 11:07:34 AM
> Subject: Re: [vdsm] flowID schema
> 
> 
> 
> ----- Original Message -----
> > From: "Saggi Mizrahi" <smizr...@redhat.com>
> > To: "Simon Grinberg" <si...@redhat.com>
> > Cc: vdsm-devel@lists.fedorahosted.org, "Dan Yasny"
> > <dya...@redhat.com>, "Ayal Baron" <aba...@redhat.com>
> > Sent: Thursday, February 16, 2012 5:18:26 PM
> > Subject: Re: [vdsm] flowID schema
> > 
> > You could just cross check flowID (that is printed in RHEV-M to all
> > the call IDs in that flow, It's a simple enough tool to make).
> > 
> > Also if you look at the way VDSM is heading you will see that just
> > grepping for the flow ID will gradually give you a smaller and
> > smaller picture of the actual flow.
> > 
> > - The new connection management means that the actual connect is
> > not
> > a direct part of the flow anymore.
> > - We plan to have a similar mechanism for domains.
> > - The new image code will do a lot of things "out of band".
> > - There will be more flows using multiple hosts because of SDM and
> > the increased safety gained by sanlock and they will not share the
> > flow ID in internal communication.
> > - Engine actually polls all tasks in a single thread (with it's own
> > flow ID? No flowID?) so even the actual result for async tasks
> > might
> > have a different flowID in the VDSM log.
> 
> OK so they may be a terminology mismatch here:
> A flow as an end user would see it is: Everything that happened since
> the moment he clicked "OK" and to the moment the operation failed
> succeeded.
> Same goes for internal/auto triggered actions in response to an event
> (Storage failure etc).
> 
> It may be composed as you notes above of several connections,
> multiple tasks etc.
> The flow ID as I see it is the linkage that makes sense in all of
> these. Everything that you've wrote above just convinces me that
> such an ID is a must.
I understood that. The point is that the system is too complex to be able to 
find certain operations to flows.
If you connect in FlowX but use a resource provided by the connection in flowY. 
The connection breaks and you loose the resource.
As long as you somehow pinned the flowID value to the connection object it 
might be logged as flowX. The actual problem happened in flowY.
My point is that the logical leaps required to extract only relevant data to a 
flow are impossible at run time as VDSM is too low in the stack to know the 
context and figure out what's related and what isn't. For simple log syncing 
you could just force users to use NTP.
> 
> 
> > 
> > In order to keep flowID sane in that kind of environment, there
> > will
> > have to be a lot of code moving that value around just to log it.
> 
> Indeed, write well once and serve many later, that is what coding is
> all about.
> If you do it while you code you'll simplify the external tools
> required while making even manual debug tasks more reasonable. And
> more important - faster
> 
> It is called Design for Debug!!! And in hardware world, it consumes
> 20% more resources. It saves much more afterwards.
> 
> Most important, it will save escalations to developers, meaning that
> a developer that will invest in that actually does this for his own
> sanity (unless he does not care :)). The easier it is to debug the
> system by the end user first then by community as an escalation
> point and only then developers, the more time you'll have to do what
> you love to do best - develop.
> 
> Remember, developers are few, community is larger and users (if you
> are lucky) are in order of magnitudes larger. Every effort you spend
> in the design phase to reduce escalations will reward you later.
> 
> 
> 
> > 
> > What we really need is something like the systemd journal
> > (http://bit.ly/vbZEBQ).
> > The project is still in it's infancy but the actual ideas are
> > solid.
> > * Always log in UTC
> > * Log in a query-able format with typed key\value pairs instead of
> > just free text messages
> > * Use MSGID for *some* of the more important log lines for easy
> > matching (and use UUID for MSGID instead of a rolling number)
> 
> That is a good start - always good to have regardless of the above
> discussion.
> 
> > 
> > With this kind of format you can easily write tools that sync up
> > logs
> > and do any other post processing.
> > It will also allow us to easily migrate to the systemd journal when
> > it's available.
> > 
> > We could just log to an sqlite db but I have a felling the constant
> > fsyncing\flocking will be a performance issue.
> 
> Some will agree, some will disagree, not sure that having another DB
> in the system is advisable.
> 
> > 
> > So I suggest implementing this by having a log target which writes
> > json objects and have a tool to put it all in an sqlite DB when
> > necessary. This way you don't get a performance hit for doing
> > something drastically different then appending to text files but
> > you
> > can later easily "pack" the logs to a db file that can easily be
> > navigated, merged and processed.
> > 
> > You could get the "classic view" by just using the msg field from
> > the
> > json object for each line. You could do that without going through
> > the "packing" process so it's instantly available.
> > 
> > I've been pondering this for quite some time and this might be a
> > good
> > excuse to give it a shot. As long as all the components in OVIRT
> > write json objects we could merge all logs together to a single log
> > db.
> > You could then try and detect small time time drifts using
> > heuristics
> > and have a unified view of everything.
> > 
> > I can do a more in depth write up if people think it's a good
> > direction to go to.
> > 
> > ----- Original Message -----
> > > From: "Simon Grinberg" <si...@redhat.com>
> > > To: "Ayal Baron" <aba...@redhat.com>, "Saggi Mizrahi"
> > > <smizr...@redhat.com>
> > > Cc: vdsm-devel@lists.fedorahosted.org, "Dan Yasny"
> > > <dya...@redhat.com>
> > > Sent: Wednesday, February 15, 2012 5:50:34 AM
> > > Subject: Re: [vdsm] flowID schema
> > > 
> > > Hi Ayal,
> > > 
> > >  > > actual use case where someone would have actually
> > >  > > benefited
> > >  > > from
> > >  > > this.
> > > 
> > > 
> > > Sorry joining that late into the discussion.
> > > 
> > > Forget the automatic analyses/past issues recognition that may or
> > > may
> > > not be possible if we do have flow ID. (Personally I believe it's
> > > possible for known issues)
> > > 
> > > Think of large setups, not the few hosts setups that are running
> > > in
> > > the labs.
> > > Dozens of hosts, hundreds of VMs and tasks, running for few
> > > months.
> > > --->>> Huge logs (yes we even got 50G sized logs)
> > > 
> > > Now try to do what you and Saggy are suggesting said, yes it will
> > > work!!! Sometimes it is the only approach that will be possible.
> > > 
> > > But what about efficiency?
> > > 
> > > Some times I found myself spending 30 minutes just getting to the
> > > files that I need to view and to the approximate location of the
> > > error.
> > > If you have a flow ID you can have a tool that will extract all
> > > the
> > > relevant information from all the hosts participating in that
> > > flow
> > > and correlate that with the rhevm.log as the anchor.
> > > 
> > > (Valdik actually wrote a tool that follows the SPM and creates an
> > > SPM
> > > log no matter on which host it resides - this allowed tracking
> > > flows
> > > and proved itself useful and time saving, especially since when
> > > storage issues begin SPM tends to move around :))
> > > 
> > > In addition to that think about skill levels,
> > > For you it's very easy to go and debug, you wrote the code, you
> > > know
> > > what to expect and what is the normal behavior (most of the
> > > time).
> > > But what about others?
> > > When encountering an issue within a flow for a front line or a
> > > customer self debug the easiest way will be to actually compare
> > > it
> > > to the same flow that succeed (if exist) and understand what was
> > > supposed to happen next (or before) and did not. I actually used
> > > that technique a lot. Again a flow ID will save time. All you
> > > have
> > > to do is extract an identical flow that worked using the
> > > flow-extractor utility and then compare.
> > > 
> > > 
> > > I agree that when you get to the really tough cases, they will
> > > span
> > > across flows and your only alternative may be just to dig in.
> > > However for many others, flows tracking will save support calls,
> > > support escalations and will considerably help with education.
> > > For
> > > 2.2 we had Marina going into the code and documenting flows, so
> > > support people can learn what to expect and what to look for in
> > > the
> > > logs.
> > > 
> > > I believe it will also help engineering, lot's of times I had to
> > > gather 3 engineers and sat together just to understand a flow in
> > > order to debug an issue or understand what to expect if someone
> > > complained about a weird behavior. Since each of them only new
> > > what
> > > his part is doing - all had to go into their code and discuss.
> > > With
> > > flow ID you can just extract the flow and easily see what is
> > > being
> > > done (you may find out as happened before that it's not exactly
> > > what
> > > they meant to do)
> > > 
> > > In short, if we aim to look at RHEV as a system, flows are
> > > helpful
> > > in
> > > many aspects. Instead of having discrete components and requiring
> > > a
> > > high level of expertise and spending a lot of time (finding the
> > > right spot) to debug even minor issues you can look at a system
> > > and
> > > easily get (most of the time), understand what to expect and get
> > > to
> > > the most relevant logs fairly fast.
> > > 
> > > Simon.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Dan Yasny" <dya...@redhat.com>
> > > > To: "Ayal Baron" <aba...@redhat.com>
> > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > vdsm-devel@lists.fedorahosted.org
> > > > Sent: Friday, February 10, 2012 4:01:01 PM
> > > > Subject: Re: [vdsm] flowID schema
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > > vdsm-devel@lists.fedorahosted.org
> > > > > Sent: Friday, 10 February, 2012 3:51:01 PM
> > > > > Subject: Re: [vdsm] flowID schema
> > > > > 
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > 
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > Sent: Friday, 10 February, 2012 2:55:46 PM
> > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > ----- Original Message -----
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Ayal Baron" <aba...@redhat.com>
> > > > > > > > > To: "Dan Yasny" <dya...@redhat.com>
> > > > > > > > > Cc: "Simon Grinberg" <si...@redhat.com>,
> > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > Sent: Friday, 10 February, 2012 12:50:04 AM
> > > > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: "Saggi Mizrahi" <smizr...@redhat.com>
> > > > > > > > > > > To: "Keith Robertson" <krobe...@redhat.com>
> > > > > > > > > > > Cc: "VDSM Project Development"
> > > > > > > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > > > > > > Sent: Thursday, February 9, 2012 2:24:44 PM
> > > > > > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > > > > >
> > > > > > > > > > > -1
> > > > > > > > > > >
> > > > > > > > > > > I agree that for messaging environment having a
> > > > > > > > > > > Message
> > > > > > > > > > > ID
> > > > > > > > > > > is
> > > > > > > > > > > a
> > > > > > > > > > > must
> > > > > > > > > > > because you sometimes don't have a particular
> > > > > > > > > > > target
> > > > > > > > > > > so
> > > > > > > > > > > when
> > > > > > > > > > > you
> > > > > > > > > > > get
> > > > > > > > > > > a response you need to know what this node is
> > > > > > > > > > > actually
> > > > > > > > > > > responding
> > > > > > > > > > > to.
> > > > > > > > > > >
> > > > > > > > > > > The message ID could be composed with
> > > > > > > > > > > <FLOWID><MSGID>
> > > > > > > > > > > so
> > > > > > > > > > > you
> > > > > > > > > > > can
> > > > > > > > > > > reuse the field.
> > > > > > > > > > >
> > > > > > > > > > > But that is all besides the point.
> > > > > > > > > > >
> > > > > > > > > > > I understand that someone might find it fun to go
> > > > > > > > > > > on
> > > > > > > > > > > following
> > > > > > > > > > > the
> > > > > > > > > > > entire flow in the Engine and in VDSM. But I
> > > > > > > > > > > would
> > > > > > > > > > > like
> > > > > > > > > > > to
> > > > > > > > > > > hear
> > > > > > > > > > > an
> > > > > > > > > > > actual use case where someone would have actually
> > > > > > > > > > > benefited
> > > > > > > > > > > from
> > > > > > > > > > > this.
> > > > > > > > > > > As I see it having VSDM return the task ID with
> > > > > > > > > > > every
> > > > > > > > > > > response
> > > > > > > > > > > (and
> > > > > > > > > > > not just for async tasks) is a lot more useful
> > > > > > > > > > > and
> > > > > > > > > > > correct.
> > > > > > > > > > 
> > > > > > > > > > Actually, the only way to understand what happened
> > > > > > > > > > in
> > > > > > > > > > a
> > > > > > > > > > certain
> > > > > > > > > > flow
> > > > > > > > > > is to follow it through. From the engine log where
> > > > > > > > > > an
> > > > > > > > > > action
> > > > > > > > > > was
> > > > > > > > > > initiated, down to the hosts that did the
> > > > > > > > > > execution.
> > > > > > > > > > Everything
> > > > > > > > > > RHEV
> > > > > > > > > > does is a flow, and with no correlation between
> > > > > > > > > > hosts
> > > > > > > > > > executing
> > > > > > > > > > parts of the same flow, troubleshooting turns into
> > > > > > > > > > guesswork,
> > > > > > > > > > because the only contact point left is time, which
> > > > > > > > > > is
> > > > > > > > > > useless
> > > > > > > > > > when
> > > > > > > > > > you're talking about vdsm - there are sometimes
> > > > > > > > > > hundreds
> > > > > > > > > > of
> > > > > > > > > > log
> > > > > > > > > > records in a single second, and not every host is
> > > > > > > > > > in
> > > > > > > > > > absolute
> > > > > > > > > > sync
> > > > > > > > > > with every other.
> > > > > > > > > 
> > > > > > > > > What are you talking about? you know exactly what
> > > > > > > > > operation
> > > > > > > > > the
> > > > > > > > > engine ran at vdsm level.
> > > > > > > > 
> > > > > > > > Not always true. Haven't had much chance to deeply dive
> > > > > > > > into
> > > > > > > > 3.0
> > > > > > > > logs, and if things changed there, it's already a huge
> > > > > > > > step
> > > > > > > > in
> > > > > > > > the
> > > > > > > > right direction.
> > > > > > > 
> > > > > > > engine logs now always log calls to vdsm.
> > > > > > 
> > > > > > finally! :)
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > If it's a task then you also have a task
> > > > > > > > > id which is a uuid so you don't need anything else.
> > > > > > > > 
> > > > > > > > Right, but 1. not everything is a task and
> > > > > > > 
> > > > > > > Perhaps that should be fixed...
> > > > > > > 
> > > > > > > > 2. tasks spawn other tasks, and need to be followed
> > > > > > > > through
> > > > > > > > for
> > > > > > > > that
> > > > > > > 
> > > > > > > In vdsm? no they don't
> > > > > > 
> > > > > > Err, ok, in the engine then. The point being, a flow like
> > > > > > MoveMultipleImages (pardon my bad memory for spelling)
> > > > > > spawns
> > > > > > several tasks, and even within those tasks there's usually
> > > > > > more
> > > > > > than
> > > > > > a single flow.
> > > > > > 
> > > > > > > 
> > > > > > > > 3. long running tasks are hell to debug, because they
> > > > > > > > span
> > > > > > > > several
> > > > > > > > log files and thousands lines of logs
> > > > > > > 
> > > > > > > flowid would not solve this in any way.
> > > > > > > All you need to do to see the entire task is grep the
> > > > > > > taskID
> > > > > > > as
> > > > > > > the
> > > > > > > thread name is the taskID
> > > > > > 
> > > > > > which causes me to
> > > > > > 1. jump between engine and vdsm logs, comparing task IDs
> > > > > > and
> > > > > > actions.
> > > > > > 2. do it many times in case of a complex flow
> > > > > > 
> > > > > > Seeing the entire thing on RHEV/oVirt as a system would
> > > > > > make
> > > > > > a
> > > > > > better
> > > > > > picture, and being able to grep the entire set of logs for
> > > > > > a
> > > > > > single
> > > > > > ID is a huge plus, since it would produce one large log
> > > > > > that
> > > > > > (hopefully, and if everything is done right) shows the full
> > > > > > flow,
> > > > > > step by step, everywhere it took place
> > > > > 
> > > > > Which is never what you really want as it is pretty useless.
> > > > >  What
> > > > > you want is the point of failure and a few hundred lines
> > > > > before.
> > > > 
> > > > The problem is - the point of failure is not always the point
> > > > with
> > > > the python traceback. This is why a flow should be follow-able
> > > > 
> > > > >  The rest is just distraction which is the point of this
> > > > >  thread.
> > > > >   I
> > > > > have never encountered a case where I needed such a flow to
> > > > > debug
> > > > > an
> > > > > issue (even when it spread across multiple hosts).
> > > > 
> > > > No, the rest is an indication of flow progress, which can help
> > > > with
> > > > troubleshooting of issues that don't end in an error. Like
> > > > performance and scalability issues
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > In addition, now that engine logs results, you can
> > > > > > > > > just
> > > > > > > > > grep
> > > > > > > > > that
> > > > > > > > > instead of a flow id and land at the exact correct
> > > > > > > > > command
> > > > > > > > > and
> > > > > > > > > not
> > > > > > > > > have to figure out which out of the 5 run in this
> > > > > > > > > flow
> > > > > > > > > is
> > > > > > > > > the
> > > > > > > > > relevant one.
> > > > > > > > 
> > > > > > > > Haven't seen that yet, but again, what are results?
> > > > > > > > When
> > > > > > > > the
> > > > > > > > failure
> > > > > > > > is somewhere in the middle of the flow, the resulting
> > > > > > > > failure
> > > > > > > > can
> > > > > > > > be
> > > > > > > > totally irrelevant.
> > > > > > > 
> > > > > > > When you have the failure message then all you have to do
> > > > > > > is
> > > > > > > grep
> > > > > > > it
> > > > > > > in the log and reach the exact call that failed.  Just
> > > > > > > search
> > > > > > > for
> > > > > > > "ERROR|FAILED" above that and you reach the place of
> > > > > > > error.
> > > > > > 
> > > > > > It's the typical approach to logs, that works for other
> > > > > > systems.
> > > > > > With
> > > > > > RHEV, just finding the traceback is never the solution,
> > > > > > because
> > > > > > it
> > > > > > shows where everything finally went offhill, instead of
> > > > > > where
> > > > > > the
> > > > > > error happened.
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > If you could give a real example where this would be
> > > > > > > > > beneficial
> > > > > > > > > (i.e.
> > > > > > > > > log excerpts, how you correlated them and how flow id
> > > > > > > > > would
> > > > > > > > > have
> > > > > > > > > eased your job) that would be great.
> > > > > > > > 
> > > > > > > > Don't have these handy, guess Vladik, who was
> > > > > > > > collecting
> > > > > > > > interesting
> > > > > > > > fail flows could have helped.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Note that I've also discussed this with Yaniv from qe
> > > > > > > > > who
> > > > > > > > > said
> > > > > > > > > they
> > > > > > > > > don't really need it.
> > > > > > > > 
> > > > > > > > I'm not saying I want to see a flow ID as such, what I
> > > > > > > > _am_
> > > > > > > > saying
> > > > > > > > is
> > > > > > > > that flows are important, and we need an easy way of
> > > > > > > > following
> > > > > > > > them
> > > > > > > > through.
> > > > > > > > When a user starts a process in the engine, it should
> > > > > > > > be
> > > > > > > > clearly
> > > > > > > > logged and marked, then it should consistently report
> > > > > > > > progress
> > > > > > > > and
> > > > > > > > interim outputs, say which host was picked for what
> > > > > > > > action,
> > > > > > > > and
> > > > > > > > how
> > > > > > > > that action can be identified in the vdsm logs.
> > > > > > > 
> > > > > > > That's fine (although periodic logging causes log
> > > > > > > overflow)
> > > > > > 
> > > > > > Can't those be aggregated somehow? Or sent into a different
> > > > > > log?
> > > > > > There was even an idea of maintaining a local db for
> > > > > > logging
> > > > > > instead
> > > > > > of plaintext, sqlite would serve for that purpose
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > We cannot rely on timing. We cannot rely on everyone
> > > > > > > > knowing
> > > > > > > > obscure
> > > > > > > > engine/vdsm action naming conventions, that are not
> > > > > > > > exactly
> > > > > > > > the
> > > > > > > > same
> > > > > > > > as they look in the GUI. A person with no understanding
> > > > > > > > of
> > > > > > > > engine
> > > > > > > > and vdsm internals, should be able to easily follow an
> > > > > > > > action
> > > > > > > > through to conclusion, and understand what was done at
> > > > > > > > each
> > > > > > > > step,
> > > > > > > > what the system got in return and how it reacted. And
> > > > > > > > all
> > > > > > > > of
> > > > > > > > these
> > > > > > > > actions should be clearly inter-related, so a single
> > > > > > > > grep
> > > > > > > > can
> > > > > > > > select
> > > > > > > > a flow.
> > > > > > > > 
> > > > > > > > If you prefer to go in the other direction, like
> > > > > > > > turning
> > > > > > > > everything
> > > > > > > > into a task, that can also work, but each task, when
> > > > > > > > spawning
> > > > > > > > another, should very clearly show the relation as well,
> > > > > > > > and
> > > > > > > > sets
> > > > > > > > of
> > > > > > > > tasks should be marked as sets.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > >
> > > > > > > > > > > A generic debugging scenario as I see it.
> > > > > > > > > > >
> > > > > > > > > > > 1. Something went wrong
> > > > > > > > > > > 2. You go looking in the ENGINE log trying to
> > > > > > > > > > > figure
> > > > > > > > > > > out
> > > > > > > > > > > what
> > > > > > > > > > > happend.
> > > > > > > > > > > 3. You see that ENGINE got SomeError.
> > > > > > > > > > 
> > > > > > > > > > ok, the rest are all downhill.
> > > > > > > > > > 
> > > > > > > > > > 4. You follow the failure back to the start of the
> > > > > > > > > > flow,
> > > > > > > > > > then
> > > > > > > > > > go
> > > > > > > > > > with
> > > > > > > > > > the flow to the point where the engine exited to
> > > > > > > > > > vdsm
> > > > > > > > > > 5. switch over to vdsm logs, make sure you have the
> > > > > > > > > > timing
> > > > > > > > > > right
> > > > > > > > > > (with no flow ID that's the olny orientation after
> > > > > > > > > > all)
> > > > > > > > > > 6. find the start of the vdsm-side flow, follow it
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > failure,
> > > > > > > > > > pray the error makes sense.
> > > > > > > > > > 
> > > > > > > > > > In many cases the answer is not in the vdsm failure
> > > > > > > > > > traceback
> > > > > > > > > > but
> > > > > > > > > > somewhere in the middle of the flow, with no errors
> > > > > > > > > > reported,
> > > > > > > > > > this
> > > > > > > > > > is why we need a way to easily follow things
> > > > > > > > > > through.
> > > > > > > > > > Moreover,
> > > > > > > > > > the
> > > > > > > > > > logs should be readable enough to make sense to a
> > > > > > > > > > typical
> > > > > > > > > > sysadmin,
> > > > > > > > > > and not a RHEV expert.
> > > > > > > > > > 
> > > > > > > > > > > 4. Check to see if this error makes sense
> > > > > > > > > > > imagining
> > > > > > > > > > > that
> > > > > > > > > > > VDSM
> > > > > > > > > > > is
> > > > > > > > > > > always right and is a black box.
> > > > > > > > > > > 5. You did your digging and now you think that
> > > > > > > > > > > VDSM
> > > > > > > > > > > is
> > > > > > > > > > > as
> > > > > > > > > > > fault.
> > > > > > > > > > > 6. Go look for the call that failed. (If we
> > > > > > > > > > > returned
> > > > > > > > > > > the
> > > > > > > > > > > taskID
> > > > > > > > > > > it's
> > > > > > > > > > > pretty simple to find that call).
> > > > > > > > > > > 7. Look around the call to check VDSM state.
> > > > > > > > > > > 8. Profit.
> > > > > > > > > > >
> > > > > > > > > > > There is never a point where you want to follow a
> > > > > > > > > > > whole
> > > > > > > > > > > flow
> > > > > > > > > > > call
> > > > > > > > > > > by
> > > > > > > > > > > call going back and forth, and even if you did
> > > > > > > > > > > having
> > > > > > > > > > > the
> > > > > > > > > > > VDSM
> > > > > > > > > > > taskID is a better anchor then flowID.
> > > > > > > > > > 
> > > > > > > > > > not everything is a task, flow IDs would unify
> > > > > > > > > > entire
> > > > > > > > > > flows,
> > > > > > > > > > and
> > > > > > > > > > make
> > > > > > > > > > following them easy.
> > > > > > > > > > 
> > > > > > > > > > >
> > > > > > > > > > > VDSM is built in a way that every call takes in
> > > > > > > > > > > to
> > > > > > > > > > > account
> > > > > > > > > > > the
> > > > > > > > > > > current state only. Debugging it with an engine
> > > > > > > > > > > flow
> > > > > > > > > > > mindset
> > > > > > > > > > > is
> > > > > > > > > > > just
> > > > > > > > > > > wrong and distracting. I see it doing more harm
> > > > > > > > > > > the
> > > > > > > > > > > good
> > > > > > > > > > > by
> > > > > > > > > > > reinforcing bad debugging practices.
> > > > > > > > > > 
> > > > > > > > > > Maybe you're right, though I can't see how from my
> > > > > > > > > > experience
> > > > > > > > > > so
> > > > > > > > > > far,
> > > > > > > > > > but following the flows is the only thing that got
> > > > > > > > > > cases
> > > > > > > > > > resolved.
> > > > > > > > > > Not event IDs making every possible error, and not
> > > > > > > > > > task
> > > > > > > > > > IDs
> > > > > > > > > > (though
> > > > > > > > > > these do have their uses) - slow and meticulous
> > > > > > > > > > mapping
> > > > > > > > > > of
> > > > > > > > > > flows
> > > > > > > > > > to
> > > > > > > > > > log records.
> > > > > > > > > > 
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > From: "Keith Robertson" <krobe...@redhat.com>
> > > > > > > > > > > > To: "VDSM Project Development"
> > > > > > > > > > > > <vdsm-devel@lists.fedorahosted.org>
> > > > > > > > > > > > Sent: Thursday, February 9, 2012 1:34:43 PM
> > > > > > > > > > > > Subject: Re: [vdsm] flowID schema
> > > > > > > > > > > >
> > > > > > > > > > > > On 02/09/2012 12:18 PM, Andrew Cathrow wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > >> From: "Ayal Baron"<aba...@redhat.com>
> > > > > > > > > > > > >> To: "Dan Kenigsberg"<dan...@redhat.com>
> > > > > > > > > > > > >> Cc: "VDSM Project
> > > > > > > > > > > > >> Development"<vdsm-devel@lists.fedorahosted.org>
> > > > > > > > > > > > >> Sent: Monday, February 6, 2012 10:35:54 AM
> > > > > > > > > > > > >> Subject: Re: [vdsm] flowID schema
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> ----- Original Message -----
> > > > > > > > > > > > >>> On Thu, Feb 02, 2012 at 10:32:49AM -0500,
> > > > > > > > > > > > >>> Saggi
> > > > > > > > > > > > >>> Mizrahi
> > > > > > > > > > > > >>> wrote:
> > > > > > > > > > > > >>>> flowID makes no sense after the initial
> > > > > > > > > > > > >>>> API
> > > > > > > > > > > > >>>> call
> > > > > > > > > > > > >>>> as
> > > > > > > > > > > > >>>> stuff
> > > > > > > > > > > > >>>> like
> > > > > > > > > > > > >>>> cacheing\threadpools\samplingtasks\resources\asyncTasks
> > > > > > > > > > > > >>>> so
> > > > > > > > > > > > >>>> flowing
> > > > > > > > > > > > >>>> a flow like that will not give you the
> > > > > > > > > > > > >>>> entire
> > > > > > > > > > > > >>>> picture
> > > > > > > > > > > > >>>> while
> > > > > > > > > > > > >>>> debugging.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> Also adding it now will make everything
> > > > > > > > > > > > >>>> even
> > > > > > > > > > > > >>>> more
> > > > > > > > > > > > >>>> ugly.
> > > > > > > > > > > > >>>> You know what, just imagine I wrote one of
> > > > > > > > > > > > >>>> my
> > > > > > > > > > > > >>>> long
> > > > > > > > > > > > >>>> rambles
> > > > > > > > > > > > >>>> about
> > > > > > > > > > > > >>>> why I don't agree with doing this.
> > > > > > > > > > > > >>> I cannot imagine you write anything like
> > > > > > > > > > > > >>> that.
> > > > > > > > > > > > >>> Really.
> > > > > > > > > > > > >>> I
> > > > > > > > > > > > >>> do
> > > > > > > > > > > > >>> not
> > > > > > > > > > > > >>> understand why you object logging flowID on
> > > > > > > > > > > > >>> API
> > > > > > > > > > > > >>> entry
> > > > > > > > > > > > >>> point.
> > > > > > > > > > > > >> The question is, what problem is this really
> > > > > > > > > > > > >> trying
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> solve
> > > > > > > > > > > > >> and
> > > > > > > > > > > > >> is
> > > > > > > > > > > > >> there a simpler and less obtrusive solution
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> that
> > > > > > > > > > > > >> problem?
> > > > > > > > > > > > > correlating logs between ovirt engine and
> > > > > > > > > > > > > potentially
> > > > > > > > > > > > > multiple
> > > > > > > > > > > > > vdsm
> > > > > > > > > > > > > nodes is a nightmare. It requires a lot skill
> > > > > > > > > > > > > to
> > > > > > > > > > > > > follow
> > > > > > > > > > > > > a
> > > > > > > > > > > > > transaction through from the front end all
> > > > > > > > > > > > > the
> > > > > > > > > > > > > way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > node,
> > > > > > > > > > > > > and even multiple nodes (eg actions on spm,
> > > > > > > > > > > > > then
> > > > > > > > > > > > > actions
> > > > > > > > > > > > > on
> > > > > > > > > > > > > other
> > > > > > > > > > > > > node to run a vm).
> > > > > > > > > > > > > Having a way to correlate the logs and follow
> > > > > > > > > > > > > a
> > > > > > > > > > > > > single
> > > > > > > > > > > > > event/flow
> > > > > > > > > > > > > is vital.
> > > > > > > > > > > > >
> > > > > > > > > > > > +1
> > > > > > > > > > > >
> > > > > > > > > > > > Knowing what command caused a sequence of
> > > > > > > > > > > > events
> > > > > > > > > > > > in
> > > > > > > > > > > > VDSM
> > > > > > > > > > > > would
> > > > > > > > > > > > be
> > > > > > > > > > > > really
> > > > > > > > > > > > helpful particularly in a threaded environment.
> > > > > > > > > > > >  Further,
> > > > > > > > > > > > wouldn't
> > > > > > > > > > > > such
> > > > > > > > > > > > an ID be helpful in an asynchronous
> > > > > > > > > > > > request/response
> > > > > > > > > > > > model?
> > > > > > > > > > > >  I'm
> > > > > > > > > > > > not
> > > > > > > > > > > > sure what the plans are for AMQP or even if
> > > > > > > > > > > > there
> > > > > > > > > > > > are
> > > > > > > > > > > > plans,
> > > > > > > > > > > > but
> > > > > > > > > > > > I'd
> > > > > > > > > > > > think that something like this would be crucial
> > > > > > > > > > > > for
> > > > > > > > > > > > an
> > > > > > > > > > > > async
> > > > > > > > > > > > response.
> > > > > > > > > > > > So, if you implemented it you might be killing
> > > > > > > > > > > > 2
> > > > > > > > > > > > birds
> > > > > > > > > > > > with
> > > > > > > > > > > > 1
> > > > > > > > > > > > stone.
> > > > > > > > > > > >
> > > > > > > > > > > > FYI: If you want to see examples of other
> > > > > > > > > > > > systems
> > > > > > > > > > > > that
> > > > > > > > > > > > use
> > > > > > > > > > > > similar
> > > > > > > > > > > > concepts, take a look at the correlation ID in
> > > > > > > > > > > > JMS.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Keith
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >>> _______________________________________________
> > > > > > > > > > > > >>> vdsm-devel mailing list
> > > > > > > > > > > > >>> vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > > > >>> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >> _______________________________________________
> > > > > > > > > > > > >> vdsm-devel mailing list
> > > > > > > > > > > > >> vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > > > >> https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > > > >>
> > > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > > vdsm-devel mailing list
> > > > > > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > > >
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > vdsm-devel mailing list
> > > > > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > > >
> > > > > > > > > > > _______________________________________________
> > > > > > > > > > > vdsm-devel mailing list
> > > > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > >
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > 
> > > > > > > > > > Dan Yasny
> > > > > > > > > > Red Hat Israel
> > > > > > > > > > +972 9769 2280
> > > > > > > > > > _______________________________________________
> > > > > > > > > > vdsm-devel mailing list
> > > > > > > > > > vdsm-devel@lists.fedorahosted.org
> > > > > > > > > > https://fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > 
> > > > > > > > Dan Yasny
> > > > > > > > Red Hat Israel
> > > > > > > > +972 9769 2280
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Regards,
> > > > > > 
> > > > > > Dan Yasny
> > > > > > Red Hat Israel
> > > > > > +972 9769 2280
> > > > > > 
> > > > > 
> > > > 
> > > > --
> > > > 
> > > > 
> > > > 
> > > > Regards,
> > > > 
> > > > Dan Yasny
> > > > Red Hat Israel
> > > > +972 9769 2280
> > > > 
> > > 
> > 
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] flowID schema

Reply via email to