Hi all, High level use case description Database Activity Monitoring captures audit records from various relational and NoSQL databases. Big customers may have thousands of different monitored databases that may generate (together) billions to trillion audit records per day. Audit records are loaded once, queried many times, and never being updated. Customers may define one audit policy to audit all DB activity or multiple audit policies on which each will audit a sub-set of DB activity. Audit records are highly structured. Each record contains:
* Event time stamp and sequence id. * Database identifiers such as: Server IP, Listener port, Instance Name, DB Name, DBMS type, DB account. * Client identifiers such as: Client IP, Client port, OS User, Host, Application, Connection ID. * Operation identifiers such as: Audit policy, Operation, Query text, Bind variables values, Object schema, Object name, Object type, Error code. * Counters such as: Response time, Response size, Affected rows. The actual structure contains more than 50 columns. Most of them are string identifiers (dimensions). Audit analytics using Spark SQL with Parquet Parquet columnar encoding and compression provides a compression ratio of 1 to 100 on my typical data. Encoding is selected automatically per column by the parquet loader based on the actual data loaded into the current 128MB rows set. Multiple encodings are supported for same column. E.g. Dictionary and then RLE which are best together when a column has both small amount of long strings and repeating values. Spark Parquet integration allows each Spark worker node to read Parquet file blocks that are stored on the local HDFS node. No range partitions nor hash partitions are supported. Multi-level list partitions are supported. Adding a partition is just a matter of creating additional HDFS sub-directory with the appropriate naming conventions. The main pain point with Parquet files is that you cannot update a file after closing it. To append records, you can only add a new Parquet file on the same HDFS sub-directory. As long as Parquet file is still open, it is not available for reporting. Parquet files are not necessarily sorted. Even when they are, Spark SQL will not take advantage on sort order while filtering by sorted leading columns. Audit analytics using Spark SQL and Kudu Kudu solve the main pain point of static Parquet files. However, it introduces other problems: 1. No support for combination of RLE and dictionary encoding on same String columns (actually RLE is not supported even alone for string columns). Same typical data suffer from lower compression ratio (1 to 14). 2. Incomplete Spark SQL integration a. Tables can be created via Impala but not via Spark SQL. b. Spark SQL can append rows into Kudu table. However, I did not find documentation for how to tell Spark SQL to add range partitions when needed. c. Predicates are pushed by Spark SQL down to Kudu. However, aggregations are not. All qualified rows are returned to Spark SQL instead of each tablet will return a partially aggregated (and much smaller) rows set. 3. Primary key is mandatory even when rows will never be updated or deleted (historical data is purged by dropping entire range partition). Apache Kudu wish list * List partitions - all audit reports must filter on audit policy (equal or IN list predicate). Currently, I can use hash partitioning on policy. However, number of buckets cannot be changed after table creation. I would like any new policy to automatically create a new partition. * Interval based range partitions - let me just define the interval in micro seconds (86,400,000,000 for daily partitions). Then Kudu could add partitions automatically when rows for a new day arrive. * Automatic compression of dictionary pages when string length > 100 * Allow RLE encoding of the indexes in the dictionary of string columns. * Allow delta encoding for sequences and timestamp columns. * Aggregation push down. [https://signature.imperva.com/assets/imperva-logo.png] Ehud Eshet | Senior Researcher ehud.es...@imperva.com<mailto:ehud.es...@imperva.com> | o: +972 3-684-0114 | m: +972 52-446-1979