On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <[email protected]> wrote:
> Hi Boris. > > 1) I would like to bypass Impala as data for my bulk load coming from >> sqoop and avro files are stored on HDFS. >> > What's the objection to Impala? In the example below, Impala reads from an > HDFS-resident table, and writes to the Kudu table. > > >> 2) we do not want to deal with MapReduce. >> > > You can still use Spark... the MR reference is in regards to the > Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use > these. See, for example: > > https://dzone.com/articles/implementing-hadoops-input- > format-and-output-forma > While that's possible I'd recommend using the dataframes API instead. eg see https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark That should work as well (or better) than the MR outputformat. -Todd > However, you'll have to write (simple) Spark code, whereas with method #1 > you do effectively the same thing under the covers using SQL statements via > Impala. > > >> >> Thanks! >> What’s the most efficient way to bulk load data into Kudu? >> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> >> >> The easiest way to load data into Kudu is if the data is already managed >> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table >> SELECT * FROM some_csv_tabledoes the trick. >> >> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >> HBase, or any other data store that has an InputFormat. >> >> No tool is provided to load data directly into Kudu’s on-disk data >> format. We have found that for many workloads, the insert performance of >> Kudu is comparable to bulk load performance of other systems. >> > > -- Todd Lipcon Software Engineer, Cloudera
