On Mon, Jan 29, 2018 at 1:19 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
> thank you both. Does it make a difference from performance perspective > though if I do a bulk load through Impala versus Spark? is the Kudu client > with Spark will be faster than Impala? > Impala in recent versions has some tricks it does to pre-sort and pre-shuffle the data to avoid compactions in Kudu during the insert. Spark does not currently have these optimizations. So I would guess that Impala would be able to bulk load large datasets more efficiently than Spark for the time being. -Todd > > On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com> >> wrote: >> >>> Hi Boris. >>> >>> 1) I would like to bypass Impala as data for my bulk load coming from >>>> sqoop and avro files are stored on HDFS. >>>> >>> What's the objection to Impala? In the example below, Impala reads from >>> an HDFS-resident table, and writes to the Kudu table. >>> >>> >>>> 2) we do not want to deal with MapReduce. >>>> >>> >>> You can still use Spark... the MR reference is in regards to the >>> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use >>> these. See, for example: >>> >>> https://dzone.com/articles/implementing-hadoops-input-format >>> -and-output-forma >>> >> >> While that's possible I'd recommend using the dataframes API instead. eg >> see https://kudu.apache.org/docs/developing.html#_kudu_integ >> ration_with_spark >> >> That should work as well (or better) than the MR outputformat. >> >> -Todd >> >> >> >>> However, you'll have to write (simple) Spark code, whereas with method >>> #1 you do effectively the same thing under the covers using SQL statements >>> via Impala. >>> >>> >>>> >>>> Thanks! >>>> What’s the most efficient way to bulk load data into Kudu? >>>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> >>>> >>>> The easiest way to load data into Kudu is if the data is already >>>> managed by Impala. In this case, a simple INSERT INTO TABLE >>>> some_kudu_table SELECT * FROM some_csv_tabledoes the trick. >>>> >>>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >>>> HBase, or any other data store that has an InputFormat. >>>> >>>> No tool is provided to load data directly into Kudu’s on-disk data >>>> format. We have found that for many workloads, the insert performance of >>>> Kudu is comparable to bulk load performance of other systems. >>>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > -- Todd Lipcon Software Engineer, Cloudera