Re: bulk-upsert spark phoenix

2016-10-17 Thread Antonio Murgia

Hi Josh,

thank for your reply, I'm trying to implement a bulk save to Phoenix 
with Apache Spark, and the code you linked helped me a lot. I'm now 
facing an issue with composite primary keys, I cannot find anywhere in 
the Phoenix code where the row-key is built using the partial phoenix 
primary keys. Can someone point me to the piece of code inside Phoenix 
that realizes that?

Thank you in advance.

#A.M.


On 09/28/2016 05:10 PM, Josh Mahonin wrote:

Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop 
OutputFormat under the hood, which effectively does a parallel, batch 
JDBC upsert. It should scale depending on the number of Spark 
executors, RDD/DataFrame parallelism, and number of HBase 
RegionServers, though admittedly there's a lot of overhead involved.


The CSV Bulk loading tool uses MapReduce, it's not integrated with 
Spark. It's likely possible to do so, but it's probably a non-trivial 
amount of work. If you're interested in taking it on, I'd start with 
looking at the following classes:


https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> wrote:


Hi,

I would like to perform a Bulk insert to HBase using Apache
Phoenix from
Spark. I tried using Apache Spark Phoenix library but, as far as I was
able to understand from the code, it looks like it performs a jdbc
batch
of upserts (am I right?). Instead I want to perform a Bulk load
like the
one described in this blog post
(https://zeyuanxy.github.io/HBase-Bulk-Loading/
) but taking
advance of
the automatic transformation between java/scala types to Bytes.

I'm actually using phoenix 4.5.2, therefore I cannot use hive to
manipulate the phoenix table, and if it possible i want to avoid to
spawn a MR job that reads data from csv
(https://phoenix.apache.org/bulk_dataload.html
). Actually i just
want to
do what the csv loader is doing with MR but programmatically with
Spark
(since the data I want to persist is already loaded in memory).

Thank you all!






Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

Certainly, a JIRA ticket with a patch would be fantastic.

Thanks!

Josh

On Wed, Sep 28, 2016 at 12:08 PM, Antonio Murgia 
wrote:

> Thank you very much for your insights Josh, if I decide to develop a small
> Phoenix Library that does, through Spark, what the CSV loader does, I'll
> surely write to the mailing list, or open a Jira, or maybe even open a PR,
> right?
>
> Thank you again
>
> #A.M.
>
> On 09/28/2016 05:10 PM, Josh Mahonin wrote:
>
> Hi Antonio,
>
> You're correct, the phoenix-spark output uses the Phoenix Hadoop
> OutputFormat under the hood, which effectively does a parallel, batch JDBC
> upsert. It should scale depending on the number of Spark executors,
> RDD/DataFrame parallelism, and number of HBase RegionServers, though
> admittedly there's a lot of overhead involved.
>
> The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
> It's likely possible to do so, but it's probably a non-trivial amount of
> work. If you're interested in taking it on, I'd start with looking at the
> following classes:
>
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala
>
> Good luck,
>
> Josh
>
> On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> wrote:
>
>> Hi,
>>
>> I would like to perform a Bulk insert to HBase using Apache Phoenix from
>> Spark. I tried using Apache Spark Phoenix library but, as far as I was
>> able to understand from the code, it looks like it performs a jdbc batch
>> of upserts (am I right?). Instead I want to perform a Bulk load like the
>> one described in this blog post
>> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
>> the automatic transformation between java/scala types to Bytes.
>>
>> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
>> manipulate the phoenix table, and if it possible i want to avoid to
>> spawn a MR job that reads data from csv
>> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
>> do what the csv loader is doing with MR but programmatically with Spark
>> (since the data I want to persist is already loaded in memory).
>>
>> Thank you all!
>>
>>
>
>


Re: bulk-upsert spark phoenix

2016-09-28 Thread Antonio Murgia
Thank you very much for your insights Josh, if I decide to develop a 
small Phoenix Library that does, through Spark, what the CSV loader 
does, I'll surely write to the mailing list, or open a Jira, or maybe 
even open a PR, right?


Thank you again

#A.M.


On 09/28/2016 05:10 PM, Josh Mahonin wrote:

Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop 
OutputFormat under the hood, which effectively does a parallel, batch 
JDBC upsert. It should scale depending on the number of Spark 
executors, RDD/DataFrame parallelism, and number of HBase 
RegionServers, though admittedly there's a lot of overhead involved.


The CSV Bulk loading tool uses MapReduce, it's not integrated with 
Spark. It's likely possible to do so, but it's probably a non-trivial 
amount of work. If you're interested in taking it on, I'd start with 
looking at the following classes:


https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> wrote:


Hi,

I would like to perform a Bulk insert to HBase using Apache
Phoenix from
Spark. I tried using Apache Spark Phoenix library but, as far as I was
able to understand from the code, it looks like it performs a jdbc
batch
of upserts (am I right?). Instead I want to perform a Bulk load
like the
one described in this blog post
(https://zeyuanxy.github.io/HBase-Bulk-Loading/
) but taking
advance of
the automatic transformation between java/scala types to Bytes.

I'm actually using phoenix 4.5.2, therefore I cannot use hive to
manipulate the phoenix table, and if it possible i want to avoid to
spawn a MR job that reads data from csv
(https://phoenix.apache.org/bulk_dataload.html
). Actually i just
want to
do what the csv loader is doing with MR but programmatically with
Spark
(since the data I want to persist is already loaded in memory).

Thank you all!






Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop
OutputFormat under the hood, which effectively does a parallel, batch JDBC
upsert. It should scale depending on the number of Spark executors,
RDD/DataFrame parallelism, and number of HBase RegionServers, though
admittedly there's a lot of overhead involved.

The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
It's likely possible to do so, but it's probably a non-trivial amount of
work. If you're interested in taking it on, I'd start with looking at the
following classes:

https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
wrote:

> Hi,
>
> I would like to perform a Bulk insert to HBase using Apache Phoenix from
> Spark. I tried using Apache Spark Phoenix library but, as far as I was
> able to understand from the code, it looks like it performs a jdbc batch
> of upserts (am I right?). Instead I want to perform a Bulk load like the
> one described in this blog post
> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
> the automatic transformation between java/scala types to Bytes.
>
> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
> manipulate the phoenix table, and if it possible i want to avoid to
> spawn a MR job that reads data from csv
> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
> do what the csv loader is doing with MR but programmatically with Spark
> (since the data I want to persist is already loaded in memory).
>
> Thank you all!
>
>