Re: Bulk / Initial load of large tables into Kudu using Spark

Patrick Angeles Mon, 29 Jan 2018 11:20:06 -0800

Hi Boris.

1) I would like to bypass Impala as data for my bulk load coming from
> sqoop and avro files are stored on HDFS.
>
What's the objection to Impala? In the example below, Impala reads from an
HDFS-resident table, and writes to the Kudu table.



> 2) we do not want to deal with MapReduce.
>

You can still use Spark... the MR reference is in regards to the
Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
these. See, for example:

https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma

However, you'll have to write (simple) Spark code, whereas with method #1
you do effectively the same thing under the covers using SQL statements via
Impala.


>
> Thanks!
> What’s the most efficient way to bulk load data into Kudu?
> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>
> The easiest way to load data into Kudu is if the data is already managed
> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table
> SELECT * FROM some_csv_tabledoes the trick.
>
> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
> HBase, or any other data store that has an InputFormat.
>
> No tool is provided to load data directly into Kudu’s on-disk data format.
> We have found that for many workloads, the insert performance of Kudu is
> comparable to bulk load performance of other systems.
>

Re: Bulk / Initial load of large tables into Kudu using Spark

Reply via email to