I'd say the tooling is still Java-focused, but I found some decent CLI tooling at https://github.com/apache/parquet-mr/tree/master/parquet-tools
Specifically, I used the convert command <https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ConvertCommand.java> to go from JSON -> Parquet. JSON.gz to Parquet (gzip compression code) saved us about 35%. When you say "log writer", do you mean custom Zeek writer <https://docs.zeek.org/en/stable/frameworks/logging.html> that writes to Parquet directly? The major issue we're facing is that the schema for Zeek output can change over time (more columns can be added). That's an issue for Parquet. On Fri, Aug 30, 2019 at 2:21 PM Justin Azoff <[email protected]> wrote: > On Fri, Aug 30, 2019 at 2:17 PM Karl Pietrzak <[email protected]> wrote: > >> Good morning everyone. >> >> I'm researching compression of Zeek data. I'm currently dumping Zeek >> data into Parquet files >> > > I don't have much feedback on the uid bits, but I'm very interested in > Parquet! I had looked into doing this a while back but the tooling around > parquet was very java/big data focussed and not very CLI friendly. Are you > using the new c++ implementation in a log writer or are you converting > json to parquet? > > -- > Justin > -- Karl
_______________________________________________ zeek-dev mailing list [email protected] http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev
