Summary This proposal is aimed at solving two intertwined problems in Zeek's log- writing system:
Problem: Batch writing code duplication - Some log writers need to send multiple log records at a time in "batches". These include writers that send data to elasticsearch, splunk hec, kinesis, and various HTTP-based destinations. - Right now, each of these log writers has similar or identical code to create and manage batches. - This code duplication makes writing and maintaining "batching" log writers harder and more bug-prone. Proposed Solution: Add a new optional API for writing a batch all at once, while still supporting older log writers that don't need to write batches. Problem: Insufficient information about failures - Different log writers can fail in a variety of ways. - Some of these failure modes are amenable to automatic recovery within Zeek, and others could be corrected by an administrator if they knew about it. - However, the current system for writing log records returns a boolean indicating only two log writer statuses: "true" means "Everything's fine!", and "false" means "Emergency!!! The log writer needs to be shut down!" Proposed Solution: a. For non-batching log writers, change the "false" status to just mean "There was an error writing a log record". The log writing system will then report those failures to other Zeek components such as plug-ins, so they can monitor a log writer's health, and make more sophisticated decisions about whether a log writer can continue running or needs to be shut down. b. Batching log writers will have a new API anyway, so that will let log writers report more detail about write failures, including suggestions about possible ways to recover. -------------------------------------------------------------------------------- Design Details Current Implementation At present, log writers are C++ classes which descend from the WriterBackend pure-virtual superclass. Each log writer must override several pure virtual member functions, which include: * DoInit: Writer-specific initialization method. * DoWrite: Write one log record. Returns a boolean, where true means "everything's fine", and false means "things are so bad, the log writer needs to be shut down." Log writers can also optionally override this virtual member functions: * DoWriteLogs: Possibly writer-specific output method implementing recording zero or more log entries. The default implementation in the superclass simply calls DoWrite() in a loop. New Implementation This has two main goals: * Provide a new base class for log writers that supports writing a batch of records at once, handles all the batch creation and write logic, and offers more sophisticated per-record reporting on failures. * Provide backward compatibility so "legacy" (existing, non-batching) log writers can build and run without code changes, while changing the meaning of "false" when returned from DoWrite() to "sending this one log record failed." These goals will be achieved using three writer backend classes: 1. BaseWriterBackend This will be a virtual base class, and is a superclass for both legacy and batching log writers. - It will have the same API signature as the existing WriterBackend, except it will omit DoWrite(). - It will also expose the existing DoWriteLogs() member function as a pure virtual function, so there's a standard interface for WriterBackend::Write() to call. 2. WriterBackend This class will derive from BaseWriterBackend, and will support legacy log writers as a drop-in replacement for the existing WriterBackend class. - It will add a pure virtual DoWrite member function to BaseWriterBackend, so its API signature will be identical to the existing WriterBackend class. That will let legacy log writers inherit from it with no code changes, and also support new log writers that don't need batching. - The return semantics for DoWrite will change so when it returns false, that will simply mean the argument record wasn't successfully written. - Its specialization of DoWriteLogs will be nearly identical to Zeek's current implementation, except that when DoWrite returns false, DoWriteLogs will simply report the failure to the rest of Zeek, rather than triggering a log writer shutdown. Then, other Zeek components can monitor the writer's health and decide whether to shut down the log writer or let it continue. 3. BatchWriterBackend This class will derive from BaseWriterBackend, and will write logs in batches. - Instead of DoWrite, it will expose a DoWriteBatch pure virtual member function to accept logs in batches. - Its specialization of DoWriteLogs will call DoWriteBatch. - It will support configuring per-log-writer criteria that trigger flushing a batch, including: * Maximum age of the oldest cached log (default value TBD) * Maximum number of cached log records (default value TBD) - DoWriteBatch will support rejecting logs at random indices in the batch, and will report details on which logs were rejected and why. This is the proposed signature for DoWriteBatch: int BatchWriterBackend::DoWriteBatch( int num_writes, threading::Value*** vals, BatchWriterBackend::status_vector& failures ); where: num_writes = the number of log records in the batch vals = the values of the log records to be written failures = information about failed record writes The return value is the number of log records actually written. Compared to DoWriteLogs, DoWriteBatch omits the num_fields and fields arguments. Those aren't needed because the log writer already has those values, which were stored when they were supplied to its Init member function. The failures argument is a reference to a std::vector of structs the log writer can fill in with details on failures to write individual records. The individual status structs will generally look like this: struct status { int m_failed_record_index; uint32_t m_failure_reason; uint32_t m_recovery_suggestion; }; where: m_failure_reason indicates the general reason for the failure m_recovery_suggestion might contain a suggestion about handling the failure If DoWriteBatch() returns a number that's smaller than num_writes, and the failures vector is empty, the caller will assume all the failed records were at the end of the batch, and try to re-transmit them in a later batch. -------------------------------------------------------------------------------- I'd welcome any questions, suggestions or feedback. Bob Murphy | Corelight, Inc. | bob.murphy at corelight.com | www.corelight.com
_______________________________________________ Zeek-Dev mailing list Zeek-Dev@zeek.org http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev