I found this in the FAQ but I am wondering if Spark Kudu library can be used for efficient bulk loads from HDFS to Kudu directly. By a large table, I mean 5-10B row tables.
I do not really like the options described below because 1) I would like to bypass Impala as data for my bulk load coming from sqoop and avro files are stored on HDFS. 2) we do not want to deal with MapReduce. Thanks! What’s the most efficient way to bulk load data into Kudu? <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> The easiest way to load data into Kudu is if the data is already managed by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table SELECT * FROM some_csv_tabledoes the trick. You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, HBase, or any other data store that has an InputFormat. No tool is provided to load data directly into Kudu’s on-disk data format. We have found that for many workloads, the insert performance of Kudu is comparable to bulk load performance of other systems.
