Re: Bulk / Initial load of large tables into Kudu using Spark

Todd Lipcon Mon, 29 Jan 2018 11:23:28 -0800

On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <[email protected]>
wrote:


> Hi Boris.
>
> 1) I would like to bypass Impala as data for my bulk load coming from
>> sqoop and avro files are stored on HDFS.
>>
> What's the objection to Impala? In the example below, Impala reads from an
> HDFS-resident table, and writes to the Kudu table.
>
>
>> 2) we do not want to deal with MapReduce.
>>
>
> You can still use Spark... the MR reference is in regards to the
> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
> these. See, for example:
>
> https://dzone.com/articles/implementing-hadoops-input-
> format-and-output-forma
>

While that's possible I'd recommend using the dataframes API instead. eg
see
https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark

That should work as well (or better) than the MR outputformat.

-Todd



> However, you'll have to write (simple) Spark code, whereas with method #1
> you do effectively the same thing under the covers using SQL statements via
> Impala.
>
>
>>
>> Thanks!
>> What’s the most efficient way to bulk load data into Kudu?
>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>>
>> The easiest way to load data into Kudu is if the data is already managed
>> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table
>> SELECT * FROM some_csv_tabledoes the trick.
>>
>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>> HBase, or any other data store that has an InputFormat.
>>
>> No tool is provided to load data directly into Kudu’s on-disk data
>> format. We have found that for many workloads, the insert performance of
>> Kudu is comparable to bulk load performance of other systems.
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Bulk / Initial load of large tables into Kudu using Spark

Reply via email to