https://bugzilla.wikimedia.org/show_bug.cgi?id=60555
--- Comment #8 from Ori Livneh <o...@wikimedia.org> --- Copying my reply to Nuria here, so that the notes are public. > - where are validation traces failures Validation failures trigger exceptions, which are written to standard error. Stderr is captured by Upstart, so the failures are at /var/log/upstart/eventlogging-processor.log. > - why some events are affected but not others (I would expect all server side > events to be affected equally) Me too. But I don't remember if we pushed the change to both production branches at the same time. So it may be that one branch was generating events that were passing validation, and the other, not. > - whether changing validation not to fail upon adding additional fields seems > a good idea. It beats ALTER TABLE `...` ADD COLUMN `random_field` ... The events need to adhere to the table definition. But it goes beyond that. This isn't simply a trade-off that comes with using a relational database as (one of several) storage backends; it's one of the core design decisions of the entire platform. The setup EventLogging replaced allowed free-form output and as a result the datasets used every possible ad-hoc format you can think of, with the result that the data was usually logged incorrectly and no one knew how to analyze it a week after it was collected because the "wisdom" of how to interpret it was lost. If you program something that handles EventLogging data, you will see the simplicity that comes with knowing definitively that your input matches your expectations. You don't need to handle edge case after edge case. Events that fail validation are still recorded in the raw logs, so they are not lost; they merely don't get first-class treatment. I can see why bug 60550 might lead you to question this design. But let me tell you: this is probably the worst screw-up in the >1 year that this has been operational; the previous norm was to have the system running for months at a time without any intervention from me or anyone else. The mistake in this case was mine (I hacked the code to pop the 'userAgent' field locally, but then clobbered it by deploying an unpatched version on top of it). In hindsight, I should have insisted on data sterility for the entirety of the process. When the validation failure has to do with the event object and not the capsule, the developer has the responsibility (and the means) for finding out, but this isn't the case with EventCapsule schema changes. But I really didn't expect this to take so long -- I thought we'd be done with the user-agent change within a day of deploying the patch at the most. If it doesn't happen now, it may be best to revert the change, drop the userAgent field from the capsule schema, and remove the column from the database -- it's not responsible of us to keep the system in this halfway state, and not surprising that it leads to mistakes and confusion. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l