Re: Bulk / Initial load of large tables into Kudu using Spark

Boris Tyukin Mon, 29 Jan 2018 13:21:26 -0800

thank you both. Does it make a difference from performance perspective
though if I do a bulk load through Impala versus Spark? is the Kudu client
with Spark will be faster than Impala?


On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com>
> wrote:
>
>> Hi Boris.
>>
>> 1) I would like to bypass Impala as data for my bulk load coming from
>>> sqoop and avro files are stored on HDFS.
>>>
>> What's the objection to Impala? In the example below, Impala reads from
>> an HDFS-resident table, and writes to the Kudu table.
>>
>>
>>> 2) we do not want to deal with MapReduce.
>>>
>>
>> You can still use Spark... the MR reference is in regards to the
>> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
>> these. See, for example:
>>
>> https://dzone.com/articles/implementing-hadoops-input-format
>> -and-output-forma
>>
>
> While that's possible I'd recommend using the dataframes API instead. eg
> see https://kudu.apache.org/docs/developing.html#_kudu_
> integration_with_spark
>
> That should work as well (or better) than the MR outputformat.
>
> -Todd
>
>
>
>> However, you'll have to write (simple) Spark code, whereas with method #1
>> you do effectively the same thing under the covers using SQL statements via
>> Impala.
>>
>>
>>>
>>> Thanks!
>>> What’s the most efficient way to bulk load data into Kudu?
>>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>>>
>>> The easiest way to load data into Kudu is if the data is already managed
>>> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table
>>> SELECT * FROM some_csv_tabledoes the trick.
>>>
>>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>>> HBase, or any other data store that has an InputFormat.
>>>
>>> No tool is provided to load data directly into Kudu’s on-disk data
>>> format. We have found that for many workloads, the insert performance of
>>> Kudu is comparable to bulk load performance of other systems.
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Bulk / Initial load of large tables into Kudu using Spark

Reply via email to