Besides the distribution and parallelism of Spark as a distributed
execution framework, I can't really see how phoenix-spark would be
faster than the JDBC driver :). Phoenix-spark and the JDBC driver are
using the same code under the hood.
Phoenix-spark is using the PhoenixOutputFormat (and thus,
PhoenixRecordWriter) to write data to Phoenix. Maybe look at
PhoenixRecordWritable, too. These ultimately are executing UPSERTs on a
PreparedStatement.
There is also the CsvBulkLoadTool which can create HFiles to bulk load
data in Phoenix. I'm not sure if phoenix-spark has something wired up
that you can use to do this out of the box (certainly, you could do it
yourself).
On 8/6/18 8:10 AM, Brandon Geise wrote:
Thanks for the reply Yun.
I’m not quite clear how this would exactly help on the upsert side? Are
you suggesting deriving the type from Phoenix then doing the
encoding/decoding and writing/reading directly from HBase?
Thanks,
Brandon
*From: *Jaanai Zhang <cloud.pos...@gmail.com>
*Reply-To: *<user@phoenix.apache.org>
*Date: *Sunday, August 5, 2018 at 9:34 PM
*To: *<user@phoenix.apache.org>
*Subject: *Re: Spark-Phoenix Plugin
You can get data type from Phoenix meta, then encode/decode data to
write/read data. I think this way is effective, FYI :)
----------------------------------------
Yun Zhang
Best regards!
2018-08-04 21:43 GMT+08:00 Brandon Geise <brandonge...@gmail.com
<mailto:brandonge...@gmail.com>>:
Good morning,
I’m looking at using a combination of Hbase, Phoenix and Spark for a
project and read that using the Spark-Phoenix plugin directly is
more efficient than JDBC, however it wasn’t entirely clear from
examples when writing a dataframe if an upsert is performed and how
much fine-grained options there are for executing the upsert. Any
information someone can share would be greatly appreciated!
Thanks,
Brandon