Hi Sunil, Sorry for the delayed response. Let me preface this by saying I'm not an Impala or HDFS expert.
Sharing resources: The "con" is that each system, Kudu, HDFS, Impala is bound to use resources that the others could use, so HDFS could fill up space on a disk that Kudu is using, and Kudu would then use a different disk (if it were configured to use multiple disks). The same goes for memory, cores, etc., although Kudu has its own ways of dealing with memory pressure, full disks, etc. The "pro" is that you could have fewer machines. SSD vs spinning disks: In terms of provisioning for Kudu, I would say that, given the option, your WAL directory should be an SSD. The WAL writes to disk on each insert, upsert, etc., so making sure this disk is performant is important. Distributing data: Disk partitioning isn't particularly relevant to how Kudu distributes data to tservers. Kudu will distribute tablets (i.e. chunks of tables that may specify a hash or range) based on your partitioning schema <https://kudu.apache.org/docs/schema_design.html> and replication factor, i.e. it distributes tablets. If your table only has a single tablet and a replication factor of 1, there will be a single chunk of data for that table in a single location. If your schema specifies multiple tablets for your table, then there will be multiple chunks of data for that table, each chunk only in a single location each (although potentially different locations per table). If you have a replication factor >1, there will be multiple copies of these chunks. Hope this helped, Andrew On Tue, Nov 21, 2017 at 4:17 PM, Sunil Parmar <[email protected]> wrote: > We are using CDH 5.12 and using HDFS for our primary data storage and > Impala for querying them. Our worker node hosts both HDFS datanode and > Impalad services. We're starting to move some of our data into KUDU and > would like to understand community experiment and recommendation on > disk/machine allocation and pro/cons for each. > > Install KUDU tablet server on each worker node vs separate machine > Separate physical disks for KUDU tablet server on same machine vs sharing > the disk with data nodes > SSD vs spinning disks > > Some more questions on separate note but kinda related to the POC > We have a small table as a first candidate for KUDU ( couple of G before > replication ) . Does KUDU tries to distribute data across tablet servers > for each table i.e. slow performance with too much sparse data. i.e. for > small table what is better fewer disk partitions ( host-partition ) vs > evenly distributed across worker nodes. > > Thanks, > Sunil Parmar > -- Andrew Wong
