Re: bulk-upsert spark phoenix

Antonio Murgia Wed, 28 Sep 2016 09:09:06 -0700

Thank you very much for your insights Josh, if I decide to develop asmall Phoenix Library that does, through Spark, what the CSV loaderdoes, I'll surely write to the mailing list, or open a Jira, or maybeeven open a PR, right?


Thank you again


#A.M.


On 09/28/2016 05:10 PM, Josh Mahonin wrote:

Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix HadoopOutputFormat under the hood, which effectively does a parallel, batchJDBC upsert. It should scale depending on the number of Sparkexecutors, RDD/DataFrame parallelism, and number of HBaseRegionServers, though admittedly there's a lot of overhead involved.

The CSV Bulk loading tool uses MapReduce, it's not integrated withSpark. It's likely possible to do so, but it's probably a non-trivialamount of work. If you're interested in taking it on, I'd start withlooking at the following classes:


https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia<antonio.mur...@eng.it <mailto:antonio.mur...@eng.it>> wrote:


    Hi,

    I would like to perform a Bulk insert to HBase using Apache
    Phoenix from
    Spark. I tried using Apache Spark Phoenix library but, as far as I was
    able to understand from the code, it looks like it performs a jdbc
    batch
    of upserts (am I right?). Instead I want to perform a Bulk load
    like the
    one described in this blog post
    (https://zeyuanxy.github.io/HBase-Bulk-Loading/
    <https://zeyuanxy.github.io/HBase-Bulk-Loading/>) but taking
    advance of
    the automatic transformation between java/scala types to Bytes.

    I'm actually using phoenix 4.5.2, therefore I cannot use hive to
    manipulate the phoenix table, and if it possible i want to avoid to
    spawn a MR job that reads data from csv
    (https://phoenix.apache.org/bulk_dataload.html
    <https://phoenix.apache.org/bulk_dataload.html>). Actually i just
    want to
    do what the csv loader is doing with MR but programmatically with
    Spark
    (since the data I want to persist is already loaded in memory).

    Thank you all!

Re: bulk-upsert spark phoenix

Reply via email to