Good morning everyone.

I'm researching compression of Zeek data.  I'm currently dumping Zeek data
into Parquet files, and one of the most challenging fields to compress is
uid because of its high entropy.

I'm wondering if there's any interest in changing the format of the uid to
something like ULID <https://github.com/ulid/spec>, of which there is a C++
implementation  <https://github.com/suyash/ulid>already.

A ULID-based uid implementation would:

   - allow uids to be sorted, which isn't helpful in-and-of-itself, but
   very helpful for compression
   - still URL-safe
   - always 26 characters, for simpler storage
   - case-insensitive


Looking through the code (UID.h
<https://github.com/bro/bro/blob/master/src/UID.h> and UID.cc
<https://github.com/bro/bro/blob/master/src/UID.cc>) and its usages, it
doesn't look technically difficult but I'm sure I'm missing some reasons.
For example, I noticed that prefixes such as the letter 'C' are used to
denote kinds of connections.  Perhaps that data can be extracted to another
field instead?

Anyways, looking for thoughts, comments, suggestions, and anything else.
Thank you!

-- 
Karl
_______________________________________________
zeek-dev mailing list
[email protected]
http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev

Reply via email to