Hi Chetan, Thank you for taking a look at Kudu! Apache Kudu is designed to perform well in OLAP workloads.
You can scale Kudu cluster horizontally pretty well at least up to few hundreds of nodes, and here you could find more information on recommended data-per-node-sizes, scaling limitations, and more: https://kudu.apache.org/docs/known_issues.html#_scale In the past, scans could become slower if the ingestion of the data follows the 'trickling inserts' pattern (see https://issues.apache.org/jira/browse/KUDU-1400), but it's been addressed and newer versions (1.10 and newer) don't have the issue. There isn't a limit of how many large tables you can host in a Kudu cluster, assuming you partition those large tables appropriately (see https://kudu.apache.org/docs/schema_design.html#schema_design) and scale the cluster as needed, especially, if some of those tables contain 'cold' data. Random reads and updates are supported regardless of the scale of a Kudu cluster. Even more: starting with Kudu 1.15 there is support for multi-row transactions marked as an experimental feature supporting INSERT/INSERT_IGNORE operations only at this point -- it targets rather the 'bulk ingest' use case, not OLTP patterns with many small transactions, though. The important points to allow as many parallel workloads against a single Kudu cluster are (a) choose table schema properly (b) partition tables accordingly (c) use multiple data directories backed by separate HDD/SSD per node (d) use SSD or NVMe devices for the WAL (d) allocate enough memory for the block cache. I'd recommend building a POC to get some real numbers because workloads vary and it's hard to provide exact numbers without knowing much of the details. As for related articles/blogs about using Kudu, I can recommend taking a look at the following relatively recent posts: https://boristyukin.com/building-near-real-time-big-data-lake-part-i/ https://boristyukin.com/building-near-real-time-big-data-lake-part-2/ Probably, other people could chime in to provide more insights based on their own experience running Kudu with their workloads. Kind regards, Alexey On Tue, Sep 21, 2021 at 11:29 PM Chetan Rautela <[email protected]> wrote: > Hi team , > > I am looking for some storage solution that can give high ingestion/update > rates and able to run OLAP queries, Apache Kudu looks one promising > solution, > Please help me to check if Apache Kudu is correct fit > > Use Case: > ------------ . > I am receiving 40K records per sec. record size is less, 5 fields > max. 2 string 2 timestamp 1 number. > With primary key I will be getting ~ 2 billion unique records per > day and rest will be updates. > With Apache Spark aggregation we can reduce 20% of updates. > TTL of each record will be 30 days. > > How much data can we store in kudu per node ? > With large updates , will get/scan request become slow over time ? > How much large tables can we create in Kudu ? > Will random read and update be supported at this scale ? > How many parallel ingestion jobs can we in a Kudu, for different tables ? > > > Please suggest some articles related to kudu sizing and performance. > > Regards, > Chetan Rautela
