Hi! I can't find any generic recommendations to choose a number of buckets in single-level hash partitioning.
All that I found: * "For large tables, prefer to use roughly 10 partitions per server in the cluster". https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html#kudu_partitioning__kudu_hash_partitioning. BTW, why 10? Looks like magic number for me :). * Some recommendations: https://kudu.apache.org/docs/known_issues.html#_scale My use case: accumulate up to 500GB-1TB of day data and run some aggregation with Spark on that data at day end. On what values should buckets number depend on? A number of servers, a number of disks (I use HDDs without any RAID), a number of CPU cores? Any suggestions? -- with best regards, Pavel Martynov
