Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-15 Thread Bob Murphy
> On Jul 15, 2020, at 1:09 AM, Robin Sommer  wrote:
> 
> On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:
> 
>> Proposed Solution: Add a new optional API for writing a batch all at once, 
>> while
>> still supporting older log writers that don't need to write batches.
> 
> That sounds good to me, a PR with the proposed API would be great.

That’s sounds great. I wanted to bounce the ideas around with people who know 
more about Zeek than i do before going into detail on a proposed API.

> 
>> a. For non-batching log writers, change the "false" status to just mean
>>   "There was an error writing a log record". The log writing system will then
>>   report those failures to other Zeek components such as plug-ins, so they 
>> can
>>   monitor a log writer's health, and make more sophisticated decisions about
>>   whether a log writer can continue running or needs to be shut down.
> 
> Not quite sure what this would look like. Right now we just shut down
> the thread on error, right? Can you elaborate how "report those
> failures to other Zeek components" and "make more sophisticated
> decisions" would look like?

Yes, right now, any writer error just shuts down the entire thread.

That’s a good solution for destinations like a disk, because if a write fails, 
something really bad has probably happened. But Seth Hall pointed out that some 
log destinations can recover, and it’s not a good solution for those.

Here are a couple of examples:

1. A writer might send log records to a network destination. If the connection 
is temporarily congested, it would start working again when the congestion 
clears.

2. The logs go to another computer that’s hung, and everything would work again 
if somebody rebooted it.

Seth's idea was to report the failures to a plugin that could be configured by 
an administrator. A plugin for a writer that goes to disk could shut down the 
writer on the first failure, like Zeek does now. And plugins for other writers 
could approach the examples above with a little more intelligence:

1. The plugin for the network destination writer could decide to shut down the 
writer only after no records have been successfully sent for a minimum of ten 
minutes.

2. The plugin for the remote-computer writer could alert an administrator to 
reboot the other computer. After that, the writer would successfully resume 
sending logs.


> Could we just change the boolean result into a tri-state (1) all good;
> (2) recoverable error, and (3) fatal error? Here, (2) would mean that
> the writer failed with an individual write, but remains prepared to
> receive further messages for output. We could the also implicitly
> treat a current "false" as (3), so that existing writers wouldn't even
> notice the difference (at the source code level at least).

I don’t think that would work, because the member function in question returns 
a bool. To change that return value to represent more than two states, we’d 
have to do one of two things:

1. Change that bool to some other type.

If we did that, existing writers wouldn’t compile any more.

2. Use casts or a union to store and retrieve values other than 0 and 1 in that 
bool, and hope those values will be preserved across the function return and 
into the code that needs to analyze them.

We can’t count on values other than 0 or 1 being preserved, because the bool 
type in C++ is a little weird, and some behaviors are implementation-dependent. 
I wrote a test program using a pointer to store 0x0F into a bool, and other 
than looking at it in a debugger, everything I did to read the value out of 
that bool turned it into 0x01, including assigning it to another bool or an 
int. The only thing that saw 0x0F in there was taking a pointer to the bool, 
casting it to a pointer to char or uint8_t, and dereferencing that pointer.


> 
>> b. Batching log writers will have a new API anyway, so that will let log
>>   writers report more detail about write failures, including suggestions 
>> about
>>   possible ways to recover.
> 
> Similar question here: how would these "suggestions" look like?


For batching, I was thinking of having a way to send back a std::vector of 
structs that would be something like this:

struct failure_info {
uint32_t index_in_batch;
uint16_t failure_type;
uint16_t recovery_suggestion;
};

The values of failure_type would be an enumeration indicating things like 
“fatal, shut down the writer”, “log record exceeds protocol limit”, “unable to 
send packet”, “unable to write to disk”, etc. Using a fixed-size struct member 
that’s larger than the enum would allow extra values to be added in the future.

recovery_suggestion would be a similar enum-in-larger-type, and let the writer 
convey more information, based on what it knows about the log destination. That 
could indicate things like, “the network connection has entirely dropped and no 
recovery is possible”, “the network connection is busy, try again later”, “this 
log record is 

Re: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe

2020-07-15 Thread Bob Murphy
> 
> On Jul 15, 2020, at 12:52 AM, Robin Sommer  wrote:
> Reading through this thread, I'm wondering if we should focus on
> improving identification of log lines in terms of where they come from
> and when they were generated, while keeping to go through the existing
> mechanism of sending messages back to main process for output (so that
> we don't need the mutex). If we sent timestamps & thread IDs along
> with the Debug() messages, one could later post-process debug.log to,
> get things sorted/split as desired.
> 
> This wouldn't support the use case of "millions of lines" very well,
> but I'm not convinced that's what we should be designing this for. A
> mutex becomes potentially problematic at that volume as well, and it
> also seems like a rare use case to begin with. In cases where it's
> really needed, a local patch to get logs into files directly (as you
> have done already) might just do the trick, no?
> 
> Robin

We could definitely change DebugLogger to improve the log line identification, 
and route it through the threading framework’s Debug() call. That will avoid 
turning debug.log into "word salad”.

However, that would also cause a delay in writing the log lines, and I've run 
into situations working on Zeek where that kind of delay would make debugging 
harder.

For example, sometimes I run tail on the log file in a terminal window. Then, 
when the code hits a breakpoint in a debugger, I can analyze the program state 
by looking at log lines emitted right before the breakpoint triggers, and 
compare them to variable contents, the stack trace, etc. That won't work if 
logging is delayed.

There are multiple, conflicting use cases for logging in Zeek. Sometimes a 
developer might think:
- Maximized throughput is important, but a delay is okay
- No delay can be tolerated, but slower throughput is okay
- Correct temporal ordering in the log is (or isn’t) important
- fflush() after every write is (or isn’t) important
- Debug logging output should go to the debug.log file, or stdout, or somewhere 
else

This is a pretty common situation around logging, in my experience.

One way to solve it, as Robin says, is for a developer with a use case Zeek 
doesn't support to apply a temporary local patch. Unfortunately, that doesn't 
help other developers who might have the same use case. Also, I personally hate 
to spend time writing code and getting it to work well, and then throw it away.

On other projects, I've used a different approach that's worked really well: 
use a single, common logging API, but let it send its output to different 
output mechanisms that support different use cases. Then a developer could pick 
the output mechanism that works best for their use case at runtime, using a 
command line option or environment variable. I think it wouldn’t be very 
complicated to add that to Zeek.

- Bob


___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


Re: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

2020-07-15 Thread Robin Sommer



On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:

> Proposed Solution: Add a new optional API for writing a batch all at once, 
> while
> still supporting older log writers that don't need to write batches.

That sounds good to me, a PR with the proposed API would be great.

> a. For non-batching log writers, change the "false" status to just mean
>"There was an error writing a log record". The log writing system will then
>report those failures to other Zeek components such as plug-ins, so they 
> can
>monitor a log writer's health, and make more sophisticated decisions about
>whether a log writer can continue running or needs to be shut down.

Not quite sure what this would look like. Right now we just shut down
the thread on error, right? Can you elaborate how "report those
failures to other Zeek components" and "make more sophisticated
decisions" would look like?

Could we just change the boolean result into a tri-state (1) all good;
(2) recoverable error, and (3) fatal error? Here, (2) would mean that
the writer failed with an individual write, but remains prepared to
receive further messages for output. We could the also implicitly
treat a current "false" as (3), so that existing writers wouldn't even
notice the difference (at the source code level at least).

> b. Batching log writers will have a new API anyway, so that will let log
>writers report more detail about write failures, including suggestions 
> about
>possible ways to recover.

Similar question here: how would these "suggestions" look like?

Robin

-- 
Robin Sommer * Corelight, Inc. * ro...@corelight.com * www.corelight.com
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


Re: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe

2020-07-15 Thread Robin Sommer
Reading through this thread, I'm wondering if we should focus on
improving identification of log lines in terms of where they come from
and when they were generated, while keeping to go through the existing
mechanism of sending messages back to main process for output (so that
we don't need the mutex). If we sent timestamps & thread IDs along
with the Debug() messages, one could later post-process debug.log to,
get things sorted/split as desired.

This wouldn't support the use case of "millions of lines" very well,
but I'm not convinced that's what we should be designing this for. A
mutex becomes potentially problematic at that volume as well, and it
also seems like a rare use case to begin with. In cases where it's
really needed, a local patch to get logs into files directly (as you
have done already) might just do the trick, no?

Robin

On Tue, Jul 14, 2020 at 14:58 -0700, Bob Murphy wrote:

> 
> > On Jul 14, 2020, at 1:14 PM, Jon Siwek  wrote:
> > 
> > On Tue, Jul 14, 2020 at 11:56 AM Bob Murphy  
> > wrote:
> > 
> >> The code you show should give correct ordering on when Foo() and Bar() 
> >> finish.
> > 
> > Wondering what's meant by "correct ordering" here.  Bar() can finish
> > before Foo() and yet debug.log can report "I did Foo" before "I did
> > Bar" for whatever thread-scheduling reasons happened to make that the
> > case.  Or Foo() and Bar() can execute together in complete concurrency
> > and it's just the LockedDebugMsg() picking an arbitrary "winner".
> > 
> > - Jon
> 
> I see your point.
> 
> For example:
> a. Foo() in thread 1 finishes before Bar() in thread 2 finishes
> b. The scheduler deactivates thread 1 for a while between the return from 
> Foo() and the execution of LockedDebugMsg("I did Foo.”)
> c. Thread 2 proceeds from the return from Bar() without interruption
> 
> Then debug.log would contain the message “I did Bar” before “I did Foo”.
> 
> So the ordering in the log file really reflects how the kernel sees the 
> temporal order of mutex locking inside LockedDebugMsg. That’s an inexact 
> approximation of the temporal order of calls to LockedDebugMsg, and that’s an 
> even more inexact approximation of the temporal order of code executed before 
> LockedDebugMsg.
> 
> For what I was doing, though, that proved to be good enough. :-)
> 
> I’d be very interested in ideas about how to improve that, especially if 
> they’re simple. I can think of a way to improve it, but it would be 
> substantially more complicated than just a mutex.
> 
> 
> 
> ___
> Zeek-Dev mailing list
> Zeek-Dev@zeek.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


-- 
Robin Sommer * Corelight, Inc. * ro...@corelight.com * www.corelight.com
___
Zeek-Dev mailing list
Zeek-Dev@zeek.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev