How large is each row in this case? Or, better yet, how large is the table in HBase?

You're spreading out approximately 7 "clients" to each Regionserver fetching results (100/14). So, you should have pretty decent saturation from Spark into HBase.

I'd be taking a look at the EXPLAIN plan for your SELECT DISTINCT to really understand what Phoenix is doing. For example, are you getting ample saturation of the resources that your servers have available (32core/128Gb memory is pretty good). Validating how busy Spark is actually keeping HBase, and how much time is spent transforming the data would be good. Or, another point, are you excessively scanning data in the system which you could otherwise preclude by a different rowkey structure via logic such as a skip-scan (which would be shown in the EXPLAIN plan).

You may actually find that using the built-in UPSERT SELECT logic may out-perform the Spark integration since you aren't actually doing any transformation logic inside of Spark.


On 3/5/18 3:14 PM, Stepan Migunov wrote:
Hi Josh, thank you for response!

Our cluster has 14 nodes (32 cores each/128 GB memory). The source Phoenix
table contains about 1 billion records (100 columns). We start a Spark's job
with about 100 executors. Spark executes SELECT from the source table
(select 6 columns with DISTINCT) and writes down output to another Phoenix
table. Expected that the target table will contains about 100 million
records.
HBase has 14 region servers, both tables salted with SALT_BUCKETS=42.
Spark's job running via Yarn.


-----Original Message-----
From: Josh Elser [mailto:els...@apache.org]
Sent: Monday, March 5, 2018 9:14 PM
To: user@phoenix.apache.org
Subject: Re: Phoenix as a source for Spark processing

Hi Stepan,

Can you better ballpark the Phoenix-Spark performance you've seen (e.g.
how much hardware do you have, how many spark executors did you use, how
many region servers)? Also, what versions of software are you using?

I don't think there are any firm guidelines on how you can solve this
problem, but you've found the tools available for you.

* You can try Phoenix+Spark to run over the Phoenix tables in place
* You can use Phoenix+Hive to offload the data into Hive for queries

If Phoenix-Spark wasn't fast enough, I'd imagine using the Phoenix-Hive
integration to query the data would be similarly not fast enough.

It's possible that the bottleneck is something we could fix in the
integration, or fix configuration of Spark and/or Phoenix. We'd need you to
help quantify this better :)

On 3/4/18 6:08 AM, Stepan Migunov wrote:
In our software we need to combine fast interactive access to the data
with quite complex data processing. I know that Phoenix intended for fast
access, but hoped that also I could be able to use Phoenix as a source for
complex processing with the Spark.  Unfortunately, Phoenix + Spark shows
very poor performance. E.g., querying big (about billion records) table
with distinct takes about 2 hours. At the same time this task with Hive
source takes a few minutes. Is it expected? Does it mean that Phoenix is
absolutely not suitable for batch processing with spark and I should
duplicate data to Hive and process it with Hive?

Reply via email to