Bulk / Initial load of large tables into Kudu using Spark

Boris Tyukin Fri, 26 Jan 2018 10:48:11 -0800

I found this in the FAQ but I am wondering if Spark Kudu library can be
used for efficient bulk loads from HDFS to Kudu directly. By a large table,
I mean 5-10B row tables.


I do not really like the options described below because

1) I would like to bypass Impala as data for my bulk load coming from
sqoop and avro files are stored on HDFS.
2) we do not want to deal with MapReduce.

Thanks!
What’s the most efficient way to bulk load data into Kudu?
<https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>

The easiest way to load data into Kudu is if the data is already managed by
Impala. In this case, a simple INSERT INTO TABLE some_kudu_table SELECT *
FROM some_csv_tabledoes the trick.

You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
HBase, or any other data store that has an InputFormat.

No tool is provided to load data directly into Kudu’s on-disk data format.
We have found that for many workloads, the insert performance of Kudu is
comparable to bulk load performance of other systems.

Bulk / Initial load of large tables into Kudu using Spark

Reply via email to