> On Tue, Oct 11, 2016 at 1:14 PM, Kamil Paral < kpa...@redhat.com > wrote:
> > > Proposal looks good to me, I don't have any strong objections.
> >
>
> > > 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> >
>
> > > 2. I think that having enum values in details in crash structure would be
> > > better, but I don't have strong opinion either way.
> >
>
> > For consistency checking, yes. But it's somewhat inflexible. If the need
> > arises, I imagine the detail string can be in json format (or
> > semicolon-separated keyvals or something) and we can store several useful
> > properties in there, not just one.
>
> I'd rather do the key-value thing as we do in ResultsDB than storing plalin
> Json. Yes the new Postgres can do it (and can also search it to some
> extent), but it is not all-mighty, and has its own problems.
Sure, that would be even better (but you didn't sound like supporting that idea
a few days back). So having "state", "blame" and "keyvals" would work well, in
my opinion.
> > E.g. not only that Koji call failed, but what was its HTTP error code. Or
> > not
> > that dnf install failed, but also whether it was the infamous "no more
> > mirror to try" error or a dependency error. I don't want to misuse that to
> > store loads of data, but this could be useful to track specific issues we
> > have hard times to track currently (e.g. our still existing depcheck issue,
> > that happens only rarely and it's difficult for us to get a list of tasks
> > affected by it). With this, we could add a flag "this is related to problem
> > XYZ that we're trying to solve".
>
> I probably understand, what you want, but I'd rather have a specified set of
> values, which will/can be acted upon. Maybe changing the structure to
> `{state, blame, cause, details}`, where the `cause` is still an enum of
> known values but details is freeform, but strictly used for humans?
Hm, I can't actually imagine any case where freeform text *strictly* for humans
would get too useful. But I can imagine cases where the field is useful for
humans *and* machines. As an example, we currently have several known causes
why minions are not properly created. Linking to backingstores might fail, the
libvirt VM creation or startup might fail, and we might time out while waiting
for the machine. Those are 3 different reasons for the same overall problem in
"MINION_CREATION". So, I imagine we can store it like this:
state: CRASHED
blame: TASKOTRON
keyvals:
step: MINION_CREATION
reason: TIMED_OUT
or like this:
state: CRASHED
blame: TASKOTRON
cause: MINION_CREATION
keyvals:
reason: TIMED_OUT
All of that is human readable and machine parseable. The top level keys are
enums in the database, the keyvals are defined in libtaskotron and the database
doesn't care. Of course, keyvals are keyvals, so we can easily add something
like "human_description: foo" if we need it and use it for some use cases, or
even "time_elapsed: 70" or whatever. It's keyvals, so we can easily ignore the
ones we don't care about, and use them only in specific cases, or whatever.
They are flexible and can be easily extended.
And now we finally get to the point. When Tim asks "do you know how often
backing stores linking fails during minion creation?" (as he really did in
T833), we can say "here's a simple query to find out!". Or, if it is not
implemented yet, we can say "let's add new 'reason: BACKINGSTORE_LINKING' and
let's see in a month!". Because currently we can't really say anything and the
best thing to do is to grep all logs, or make some custom hack on our servers.
> So we can "CRASHED->THIRDPARTY->UNKNOWN->"text of the exception" for example,
I understand it's just an example, but in this case it would be probably
CRASHED->UNKNOWN->UNKNOWN, because if you don't know to cause, you can't blame
anybody.
> or "CRASHED->TASKOTRON->NETWORK->"dnf - no more mirrors to try".
And here the blame would be THIRDPARTY, since it's NETWORK :-)
I'd prefer keyvals over this, because you can't fit anything useful for
querying to the free-form details field (or, as you said, you'd rather avoid
that). The fact that I know it crashed because of network might not be enough
information to debug issues, decide rescheduling, etc.
> I'd rather act on a known set of values, then have code like:
You probably meant 'than', right? I'm a bit confused here.
> if ('dnf' in detail and 'no more mirrors' in detail) or ('DNF' in detail and
> 'could not connect' in detail)
Do the example with keyvals I posted above look better? I'd say using the
keyvals with known enum values (defined in libtaskotron) is definitely more
realiable and maintainable than matching free-form text like above.
> in the end, it is almost the same, because there will be problems with
> clasifying the errors, and the more layers we add, the harder it gets - that
> is the reason I initially only wanted to do the {state, blame} thing.