thank you both. Does it make a difference from performance perspective though if I do a bulk load through Impala versus Spark? is the Kudu client with Spark will be faster than Impala?
On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <t...@cloudera.com> wrote: > On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com> > wrote: > >> Hi Boris. >> >> 1) I would like to bypass Impala as data for my bulk load coming from >>> sqoop and avro files are stored on HDFS. >>> >> What's the objection to Impala? In the example below, Impala reads from >> an HDFS-resident table, and writes to the Kudu table. >> >> >>> 2) we do not want to deal with MapReduce. >>> >> >> You can still use Spark... the MR reference is in regards to the >> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use >> these. See, for example: >> >> https://dzone.com/articles/implementing-hadoops-input-format >> -and-output-forma >> > > While that's possible I'd recommend using the dataframes API instead. eg > see https://kudu.apache.org/docs/developing.html#_kudu_ > integration_with_spark > > That should work as well (or better) than the MR outputformat. > > -Todd > > > >> However, you'll have to write (simple) Spark code, whereas with method #1 >> you do effectively the same thing under the covers using SQL statements via >> Impala. >> >> >>> >>> Thanks! >>> What’s the most efficient way to bulk load data into Kudu? >>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> >>> >>> The easiest way to load data into Kudu is if the data is already managed >>> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table >>> SELECT * FROM some_csv_tabledoes the trick. >>> >>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >>> HBase, or any other data store that has an InputFormat. >>> >>> No tool is provided to load data directly into Kudu’s on-disk data >>> format. We have found that for many workloads, the insert performance of >>> Kudu is comparable to bulk load performance of other systems. >>> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera >