Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-17 Thread Robin Sommer


On Thu, Jul 16, 2020 at 17:15 -0700, Bob Murphy wrote:

> Here’s how it would work:

It would be helpful to see a draft API for the full batch writing
functionality to see how the pieces would work together. Could you
mock that up?

That said, couple of thoughts:

> 2. The failure_type value would still indicate generally what
> happened, with predefined values indicating things like “network
> failure”, “protocol error”, “unable to write to disk”, or
> “unspecified failure".

In my experience, such detailed numerical error codes are rarely
useful in practice. Different writers will implement them to different
degrees and associate different semantics with them, and callers will
never quite know what to expect and how to react.

Do you actually need to distinguish the semantics for all these
different cases? Seems an alternative would be having a small set of
possible "impact" values telling the caller what to do. To take a
stab:

- temporary error: failed, but should try again with same log data
- error: failed, and trying same log data again won't help; but ok to 
continue with new log data
- fatal error: Panic, shutdown writer.

Depending on who's going to log failures, we could also just include a
textual error message as well. Logging is where more context seems
most useful I'd say.

> 3. first_index and index_count would specify a range. That way, if
> several successive log records aren’t sent for the same reason, that
> could be represented by a single struct, instead of a different struct
> for each one.

One reason I'm asking about the full API is because I'm not sure where
the ownership of logs resides that fail to write. Is the writer
keeping them? If so, it could handle the retry case internally. If the
writers discards after failure, and the caller needs to send the data
again, I'd wonder if there's a simpler return type here where we just
point to the first failed entry in the batch. The writer would simply
abort on first failure (how likely is it really that the next succeeds
immediately afterwards?)

And just to be clear why I'm making all these comments: I'm worried
about the difficulty of using this API, on both ends. The more complex
we make the things being passed around, the more difficult it gets to
implement the logic correctly and efficiently.

Robin

-- 
Robin Sommer * Corelight, Inc. * ro...@corelight.com * www.corelight.com
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-16 Thread Bob Murphy
>> For batching, I was thinking of having a way to send back a std::vector of 
>> structs that would be something like this:
>> 
>> struct failure_info {
>>uint32_t index_in_batch;
>>uint16_t failure_type;
>>uint16_t recovery_suggestion;
>> };
> 
> This is almost starting to sound a bit more complicated than is worth it.  We 
> may need to discuss this a bit more to figure out something simpler.  The 
> immediate problem that springs to mind is that as a developer, I don't think 
> I'd have any clue what failure_types and recovery_suggestions could be common 
> among export destinations.

Seth and I were talking today, and came up with something like this:
struct failure_info {
uint32_t first_index;
uint16_t index_count;
uint16_t failure_type;
};

Here’s how it would work:

1. The batch writing function would return a std::vector of these. If the 
entire batch wrote successfully, the vector would be empty.

2. The failure_type value would still indicate generally what happened, with 
predefined values indicating things like “network failure”, “protocol error”, 
“unable to write to disk”, or “unspecified failure". Seth thought we’d be 
likely to start out with about ten values like this. Using a 32-bit value for 
this provides lots of room for expansion :-) and maintain reasonable alignment 
within the struct.

3. first_index and index_count would specify a range. That way, if several 
successive log records aren’t sent for the same reason, that could be 
represented by a single struct, instead of a different struct for each one.

This drops the recovery suggestion.

The sizes of the struct fields are currently set to pack nicely into eight 
bytes, with no wasted space either within the struct or between structs in an 
array. We could make the fields different sizes, though.
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-16 Thread Seth Hall


On 15 Jul 2020, at 20:45, Bob Murphy wrote:

>> On Jul 15, 2020, at 1:09 AM, Robin Sommer  
>> wrote:
>>
>> Not quite sure what this would look like. Right now we just shut down
>> the thread on error, right? Can you elaborate how "report those
>> failures to other Zeek components" and "make more sophisticated
>> decisions" would look like?
>
> Yes, right now, any writer error just shuts down the entire thread.
>
> That’s a good solution for destinations like a disk, because if a 
> write fails, something really bad has probably happened. But Seth Hall 
> pointed out that some log destinations can recover, and it’s not a 
> good solution for those.

More or less this is the same sort of thing that I'm always pushing for 
to move more functionality into scripts.  If I got an event in 
scriptland I might be able to determine what resulting action to take in 
the script and whether or not to shut down the writer or to let it keep 
going.

> For batching, I was thinking of having a way to send back a 
> std::vector of structs that would be something like this:
>
> struct failure_info {
> uint32_t index_in_batch;
> uint16_t failure_type;
> uint16_t recovery_suggestion;
> };

This is almost starting to sound a bit more complicated than is worth 
it.  We may need to discuss this a bit more to figure out something 
simpler.  The immediate problem that springs to mind is that as a 
developer, I don't think I'd have any clue what failure_types and 
recovery_suggestions could be common among export destinations.

   .Seth

--
Seth Hall * Corelight, Inc * www.corelight.com
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-15 Thread Bob Murphy
> On Jul 15, 2020, at 1:09 AM, Robin Sommer  wrote:
> 
> On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:
> 
>> Proposed Solution: Add a new optional API for writing a batch all at once, 
>> while
>> still supporting older log writers that don't need to write batches.
> 
> That sounds good to me, a PR with the proposed API would be great.

That’s sounds great. I wanted to bounce the ideas around with people who know 
more about Zeek than i do before going into detail on a proposed API.

> 
>> a. For non-batching log writers, change the "false" status to just mean
>>   "There was an error writing a log record". The log writing system will then
>>   report those failures to other Zeek components such as plug-ins, so they 
>> can
>>   monitor a log writer's health, and make more sophisticated decisions about
>>   whether a log writer can continue running or needs to be shut down.
> 
> Not quite sure what this would look like. Right now we just shut down
> the thread on error, right? Can you elaborate how "report those
> failures to other Zeek components" and "make more sophisticated
> decisions" would look like?

Yes, right now, any writer error just shuts down the entire thread.

That’s a good solution for destinations like a disk, because if a write fails, 
something really bad has probably happened. But Seth Hall pointed out that some 
log destinations can recover, and it’s not a good solution for those.

Here are a couple of examples:

1. A writer might send log records to a network destination. If the connection 
is temporarily congested, it would start working again when the congestion 
clears.

2. The logs go to another computer that’s hung, and everything would work again 
if somebody rebooted it.

Seth's idea was to report the failures to a plugin that could be configured by 
an administrator. A plugin for a writer that goes to disk could shut down the 
writer on the first failure, like Zeek does now. And plugins for other writers 
could approach the examples above with a little more intelligence:

1. The plugin for the network destination writer could decide to shut down the 
writer only after no records have been successfully sent for a minimum of ten 
minutes.

2. The plugin for the remote-computer writer could alert an administrator to 
reboot the other computer. After that, the writer would successfully resume 
sending logs.


> Could we just change the boolean result into a tri-state (1) all good;
> (2) recoverable error, and (3) fatal error? Here, (2) would mean that
> the writer failed with an individual write, but remains prepared to
> receive further messages for output. We could the also implicitly
> treat a current "false" as (3), so that existing writers wouldn't even
> notice the difference (at the source code level at least).

I don’t think that would work, because the member function in question returns 
a bool. To change that return value to represent more than two states, we’d 
have to do one of two things:

1. Change that bool to some other type.

If we did that, existing writers wouldn’t compile any more.

2. Use casts or a union to store and retrieve values other than 0 and 1 in that 
bool, and hope those values will be preserved across the function return and 
into the code that needs to analyze them.

We can’t count on values other than 0 or 1 being preserved, because the bool 
type in C++ is a little weird, and some behaviors are implementation-dependent. 
I wrote a test program using a pointer to store 0x0F into a bool, and other 
than looking at it in a debugger, everything I did to read the value out of 
that bool turned it into 0x01, including assigning it to another bool or an 
int. The only thing that saw 0x0F in there was taking a pointer to the bool, 
casting it to a pointer to char or uint8_t, and dereferencing that pointer.


> 
>> b. Batching log writers will have a new API anyway, so that will let log
>>   writers report more detail about write failures, including suggestions 
>> about
>>   possible ways to recover.
> 
> Similar question here: how would these "suggestions" look like?


For batching, I was thinking of having a way to send back a std::vector of 
structs that would be something like this:

struct failure_info {
uint32_t index_in_batch;
uint16_t failure_type;
uint16_t recovery_suggestion;
};

The values of failure_type would be an enumeration indicating things like 
“fatal, shut down the writer”, “log record exceeds protocol limit”, “unable to 
send packet”, “unable to write to disk”, etc. Using a fixed-size struct member 
that’s larger than the enum would allow extra values to be added in the future.

recovery_suggestion would be a similar enum-in-larger-type, and let the writer 
convey more information, based on what it knows about the log destination. That 
could indicate things like, “the network connection has entirely dropped and no 
recovery is possible”, “the network connection is busy, try again later”, “this 
log record is 

Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-15 Thread Robin Sommer



On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:

> Proposed Solution: Add a new optional API for writing a batch all at once, 
> while
> still supporting older log writers that don't need to write batches.

That sounds good to me, a PR with the proposed API would be great.

> a. For non-batching log writers, change the "false" status to just mean
>"There was an error writing a log record". The log writing system will then
>report those failures to other Zeek components such as plug-ins, so they 
> can
>monitor a log writer's health, and make more sophisticated decisions about
>whether a log writer can continue running or needs to be shut down.

Not quite sure what this would look like. Right now we just shut down
the thread on error, right? Can you elaborate how "report those
failures to other Zeek components" and "make more sophisticated
decisions" would look like?

Could we just change the boolean result into a tri-state (1) all good;
(2) recoverable error, and (3) fatal error? Here, (2) would mean that
the writer failed with an individual write, but remains prepared to
receive further messages for output. We could the also implicitly
treat a current "false" as (3), so that existing writers wouldn't even
notice the difference (at the source code level at least).

> b. Batching log writers will have a new API anyway, so that will let log
>writers report more detail about write failures, including suggestions 
> about
>possible ways to recover.

Similar question here: how would these "suggestions" look like?

Robin

-- 
Robin Sommer * Corelight, Inc. * ro...@corelight.com * www.corelight.com
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


[Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-09 Thread Bob Murphy
Summary

This proposal is aimed at solving two intertwined problems in Zeek's log-
writing system:


Problem: Batch writing code duplication
- Some log writers need to send multiple log records at a time in "batches".
  These include writers that send data to elasticsearch, splunk hec, kinesis,
  and various HTTP-based destinations.
- Right now, each of these log writers has similar or identical code to create
  and manage batches.
- This code duplication makes writing and maintaining "batching" log writers
  harder and more bug-prone.

Proposed Solution: Add a new optional API for writing a batch all at once, while
still supporting older log writers that don't need to write batches.


Problem: Insufficient information about failures
- Different log writers can fail in a variety of ways.
- Some of these failure modes are amenable to automatic recovery within Zeek,
  and others could be corrected by an administrator if they knew about it.
- However, the current system for writing log records returns a boolean
  indicating only two log writer statuses: "true" means "Everything's fine!",
  and "false" means "Emergency!!! The log writer needs to be shut down!"

Proposed Solution:
a. For non-batching log writers, change the "false" status to just mean
   "There was an error writing a log record". The log writing system will then
   report those failures to other Zeek components such as plug-ins, so they can
   monitor a log writer's health, and make more sophisticated decisions about
   whether a log writer can continue running or needs to be shut down.
b. Batching log writers will have a new API anyway, so that will let log
   writers report more detail about write failures, including suggestions about
   possible ways to recover.



Design Details


Current Implementation

At present, log writers are C++ classes which descend from the WriterBackend
pure-virtual superclass. Each log writer must override several pure virtual
member functions, which include:
* DoInit: Writer-specific initialization method.
* DoWrite: Write one log record.
  Returns a boolean, where true means "everything's fine", and false means
  "things are so bad, the log writer needs to be shut down."

Log writers can also optionally override this virtual member functions:
* DoWriteLogs: Possibly writer-specific output method implementing recording
  zero or more log entries. The default implementation in the superclass simply
  calls DoWrite() in a loop.


New Implementation

This has two main goals:
* Provide a new base class for log writers that supports writing a batch of
  records at once, handles all the batch creation and write logic, and offers
  more sophisticated per-record reporting on failures.
* Provide backward compatibility so "legacy" (existing, non-batching) log
  writers can build and run without code changes, while changing the meaning of
  "false" when returned from DoWrite() to "sending this one log record failed."

These goals will be achieved using three writer backend classes:

1. BaseWriterBackend

This will be a virtual base class, and is a superclass for both legacy and
batching log writers.
- It will have the same API signature as the existing WriterBackend, except it
  will omit DoWrite().
- It will also expose the existing DoWriteLogs() member function as a pure
  virtual function, so there's a standard interface for WriterBackend::Write()
  to call.

2. WriterBackend

This class will derive from BaseWriterBackend, and will support legacy log
writers as a drop-in replacement for the existing WriterBackend class.
- It will add a pure virtual DoWrite member function to BaseWriterBackend, so
  its API signature will be identical to the existing WriterBackend class. That
  will let legacy log writers inherit from it with no code changes, and also
  support new log writers that don't need batching.
- The return semantics for DoWrite will change so when it returns false, that
  will simply mean the argument record wasn't successfully written.
- Its specialization of DoWriteLogs will be nearly identical to Zeek's current
  implementation, except that when DoWrite returns false, DoWriteLogs will
  simply report the failure to the rest of Zeek, rather than triggering a log
  writer shutdown. Then, other Zeek components can monitor the writer's health
  and decide whether to shut down the log writer or let it continue.

3. BatchWriterBackend

This class will derive from BaseWriterBackend, and will write logs in batches.
- Instead of DoWrite, it will expose a DoWriteBatch pure virtual member function
  to accept logs in batches.
- Its specialization of DoWriteLogs will call DoWriteBatch.
- It will support configuring per-log-writer criteria that trigger flushing a
  batch, including:
* Maximum age of the oldest cached log (default value TBD)
* Maximum number of cached log records (default value TBD)
- DoWriteBatch will support