Besides the distribution and parallelism of Spark as a distributed execution framework, I can't really see how phoenix-spark would be faster than the JDBC driver :). Phoenix-spark and the JDBC driver are using the same code under the hood.

Phoenix-spark is using the PhoenixOutputFormat (and thus, PhoenixRecordWriter) to write data to Phoenix. Maybe look at PhoenixRecordWritable, too. These ultimately are executing UPSERTs on a PreparedStatement.

There is also the CsvBulkLoadTool which can create HFiles to bulk load data in Phoenix. I'm not sure if phoenix-spark has something wired up that you can use to do this out of the box (certainly, you could do it yourself).

On 8/6/18 8:10 AM, Brandon Geise wrote:
Thanks for the reply Yun.

I’m not quite clear how this would exactly help on the upsert side?  Are you suggesting deriving the type from Phoenix then doing the encoding/decoding and writing/reading directly from HBase?

Thanks,

Brandon

*From: *Jaanai Zhang <cloud.pos...@gmail.com>
*Reply-To: *<user@phoenix.apache.org>
*Date: *Sunday, August 5, 2018 at 9:34 PM
*To: *<user@phoenix.apache.org>
*Subject: *Re: Spark-Phoenix Plugin

You can get data type from Phoenix meta, then encode/decode data to write/read data. I think this way is effective, FYI :)


----------------------------------------

    Yun Zhang

    Best regards!

2018-08-04 21:43 GMT+08:00 Brandon Geise <brandonge...@gmail.com <mailto:brandonge...@gmail.com>>:

    Good morning,

    I’m looking at using a combination of Hbase, Phoenix and Spark for a
    project and read that using the Spark-Phoenix plugin directly is
    more efficient than JDBC, however it wasn’t entirely clear from
    examples when writing a dataframe if an upsert is performed and how
    much fine-grained options there are for executing the upsert.  Any
    information someone can share would be greatly appreciated!

    Thanks,

    Brandon

Reply via email to