Database audit analytics - Kudu vs. Parquet

Ehud Eshet Mon, 31 Jul 2017 09:17:36 -0700

Hi all,
High level use case description
Database Activity Monitoring captures audit records from various relational and 
NoSQL databases.
Big customers may have thousands of different monitored databases that may 
generate (together) billions to trillion audit records per day.
Audit records are loaded once, queried many times, and never being updated.
Customers may define one audit policy to audit all DB activity or multiple 
audit policies on which each will audit a sub-set of DB activity.
Audit records are highly structured. Each record contains:


*        Event time stamp and sequence id.

*        Database identifiers such as: Server IP, Listener port, Instance Name, 
DB Name, DBMS type, DB account.

*        Client identifiers such as: Client IP, Client port, OS User, Host, 
Application, Connection ID.

*        Operation identifiers such as: Audit policy, Operation, Query text, 
Bind variables values, Object schema, Object name, Object type, Error code.

*        Counters such as: Response time, Response size, Affected rows.
The actual structure contains more than 50 columns. Most of them are string 
identifiers (dimensions).
Audit analytics using Spark SQL with Parquet
Parquet columnar encoding and compression provides a compression ratio of 1 to 
100 on my typical data.
Encoding is selected automatically per column by the parquet loader based on 
the actual data loaded into the current 128MB rows set.
Multiple encodings are supported for same column. E.g. Dictionary and then RLE 
which are best together when a column has both small amount of long strings and 
repeating values.
Spark Parquet integration allows each Spark worker node to read Parquet file 
blocks that are stored on the local HDFS node.
No range partitions nor hash partitions are supported.
Multi-level list partitions are supported. Adding a partition is just a matter 
of creating additional HDFS sub-directory with the appropriate naming 
conventions.
The main pain point with Parquet files is that you cannot update a file after 
closing it.
To append records, you can only add a new Parquet file on the same HDFS 
sub-directory.
As long as Parquet file is still open, it is not available for reporting.
Parquet files are not necessarily sorted. Even when they are, Spark SQL will 
not take advantage on sort order while filtering by sorted leading columns.
Audit analytics using Spark SQL and Kudu
Kudu solve the main pain point of static Parquet files.
However, it introduces other problems:

1.      No support for combination of RLE and dictionary encoding on same 
String columns (actually RLE is not supported even alone for string columns).
Same typical data suffer from lower compression ratio (1 to 14).

2.      Incomplete Spark SQL integration

a.      Tables can be created via Impala but not via Spark SQL.

b.      Spark SQL can append rows into Kudu table.
However, I did not find documentation for how to tell Spark SQL to add range 
partitions when needed.

c.      Predicates are pushed by Spark SQL down to Kudu.
However, aggregations are not. All qualified rows are returned to Spark SQL 
instead of each tablet will return a partially aggregated (and much smaller) 
rows set.

3.      Primary key is mandatory even when rows will never be updated or 
deleted (historical data is purged by dropping entire range partition).
Apache Kudu wish list

*        List partitions - all audit reports must filter on audit policy (equal 
or IN list predicate).
Currently, I can use hash partitioning on policy.
However, number of buckets cannot be changed after table creation.
I would like any new policy to automatically create a new partition.

*        Interval based range partitions - let me just define the interval in 
micro seconds (86,400,000,000 for daily partitions).
Then Kudu could add partitions automatically when rows for a new day arrive.

*        Automatic compression of dictionary pages when string length > 100

*        Allow RLE encoding of the indexes in the dictionary of string columns.

*        Allow delta encoding for sequences and timestamp columns.

*        Aggregation push down.


[https://signature.imperva.com/assets/imperva-logo.png]
Ehud Eshet | Senior Researcher
ehud.es...@imperva.com<mailto:ehud.es...@imperva.com> | o: +972 3-684-0114 | m: 
+972 52-446-1979

Database audit analytics - Kudu vs. Parquet

Reply via email to