Hi Josh,
thank for your reply, I'm trying to implement a bulk save to Phoenix
with Apache Spark, and the code you linked helped me a lot. I'm now
facing an issue with composite primary keys, I cannot find anywhere in
the Phoenix code where the row-key is built using the partial phoenix
primary keys. Can someone point me to the piece of code inside Phoenix
that realizes that?
Thank you in advance.
#A.M.
On 09/28/2016 05:10 PM, Josh Mahonin wrote:
Hi Antonio,
You're correct, the phoenix-spark output uses the Phoenix Hadoop
OutputFormat under the hood, which effectively does a parallel, batch
JDBC upsert. It should scale depending on the number of Spark
executors, RDD/DataFrame parallelism, and number of HBase
RegionServers, though admittedly there's a lot of overhead involved.
The CSV Bulk loading tool uses MapReduce, it's not integrated with
Spark. It's likely possible to do so, but it's probably a non-trivial
amount of work. If you're interested in taking it on, I'd start with
looking at the following classes:
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala
Good luck,
Josh
On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia
<antonio.mur...@eng.it <mailto:antonio.mur...@eng.it>> wrote:
Hi,
I would like to perform a Bulk insert to HBase using Apache
Phoenix from
Spark. I tried using Apache Spark Phoenix library but, as far as I was
able to understand from the code, it looks like it performs a jdbc
batch
of upserts (am I right?). Instead I want to perform a Bulk load
like the
one described in this blog post
(https://zeyuanxy.github.io/HBase-Bulk-Loading/
<https://zeyuanxy.github.io/HBase-Bulk-Loading/>) but taking
advance of
the automatic transformation between java/scala types to Bytes.
I'm actually using phoenix 4.5.2, therefore I cannot use hive to
manipulate the phoenix table, and if it possible i want to avoid to
spawn a MR job that reads data from csv
(https://phoenix.apache.org/bulk_dataload.html
<https://phoenix.apache.org/bulk_dataload.html>). Actually i just
want to
do what the csv loader is doing with MR but programmatically with
Spark
(since the data I want to persist is already loaded in memory).
Thank you all!