I would guess that Hive would always be capable of out-matching what HBase/Phoenix can do for this type of workload (bulk-transformation). That said, I'm not ready to tell you that you can't get the Phoenix-Spark integration better performing. See the other thread where you provide more details..

It's important to remember that Phoenix is designed to shine when you have workloads which require updates to a single row/column. The underlying I/O system is much different in HBase compared to Hive in order to server the random update use-case.

On 3/7/18 4:08 AM, Stepan Migunov wrote:
Some more details... We have done some simple tests to compare read/write 
possibility spark+hive and spark+phoenix. And now we have the following results:

Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec

Spark:
 From Hive to Hive: 2463 sec
 From Phoenix to Hive - 13310 sec
 From Hive to Phoenix - > 30240 sec

We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1

So it seems that Spark + Phoenix led great performance degradation. Any 
thoughts?

On 2018/03/04 11:08:56, Stepan Migunov <stepan.migu...@firstlinesoftware.com> 
wrote:
In our software we need to combine fast interactive access to the data with 
quite complex data processing. I know that Phoenix intended for fast access, 
but hoped that also I could be able to use Phoenix as a source for complex 
processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
performance. E.g., querying big (about billion records) table with distinct 
takes about 2 hours. At the same time this task with Hive source takes a few 
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
for batch processing with spark and I should  duplicate data to Hive and 
process it with Hive?

Reply via email to