[Bug 60555] Echo logs broken since Jan 16, 2014

bugzilla-daemon Fri, 31 Jan 2014 02:57:44 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=60555


--- Comment #8 from Ori Livneh <o...@wikimedia.org> ---
Copying my reply to Nuria here, so that the notes are public.

> - where are validation traces failures

Validation failures trigger exceptions, which are written to standard error.
Stderr is captured by Upstart, so the failures are at
/var/log/upstart/eventlogging-processor.log. 

> - why some events are affected but not others (I would expect all server side 
> events to be affected equally)

Me too. But I don't remember if we pushed the change to both production
branches at the same time. So it may be that one branch was generating events
that were passing validation, and the other, not.

> - whether changing validation not to fail upon adding additional fields seems 
> a good idea.

It beats ALTER TABLE `...` ADD COLUMN `random_field` ...

The events need to adhere to the table definition. But it goes beyond that.
This isn't simply a trade-off that comes with using a relational database as
(one of several) storage backends; it's one of the core design decisions of the
entire platform. The setup EventLogging replaced allowed free-form output and
as a result the datasets used every possible ad-hoc format you can think of,
with the result that the data was usually logged incorrectly and no one knew
how to analyze it a week after it was collected because the "wisdom" of how to
interpret it was lost.

If you program something that handles EventLogging data, you will see the
simplicity that comes with knowing definitively that your input matches your
expectations. You don't need to handle edge case after edge case.

Events that fail validation are still recorded in the raw logs, so they are not
lost; they merely don't get first-class treatment.

I can see why bug 60550 might lead you to question this design. But let me tell
you: this is probably the worst screw-up in the >1 year that this has been
operational; the previous norm was to have the system running for months at a
time without any intervention from me or anyone else. The mistake in this case
was mine (I hacked the code to pop the 'userAgent' field locally, but then
clobbered it by deploying an unpatched version on top of it).

In hindsight, I should have insisted on data sterility for the entirety of the
process. When the validation failure has to do with the event object and not
the capsule, the developer has the responsibility (and the means) for finding
out, but this isn't the case with EventCapsule schema changes. But I really
didn't expect this to take so long -- I thought we'd be done with the
user-agent change within a day of deploying the patch at the most. If it
doesn't happen now, it may be best to revert the change, drop the userAgent
field from the capsule schema, and remove the column from the database -- it's
not responsible of us to keep the system in this halfway state, and not
surprising that it leads to mistakes and confusion.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 60555] Echo logs broken since Jan 16, 2014

Reply via email to