Some more details... We have done some simple tests to compare read/write 
possibility spark+hive and spark+phoenix. And now we have the following results:

Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec

Spark:
>From Hive to Hive: 2463 sec
>From Phoenix to Hive - 13310 sec
>From Hive to Phoenix - > 30240 sec

We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1

So it seems that Spark + Phoenix led great performance degradation. Any 
thoughts?

On 2018/03/04 11:08:56, Stepan Migunov <stepan.migu...@firstlinesoftware.com> 
wrote: 
> In our software we need to combine fast interactive access to the data with 
> quite complex data processing. I know that Phoenix intended for fast access, 
> but hoped that also I could be able to use Phoenix as a source for complex 
> processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
> performance. E.g., querying big (about billion records) table with distinct 
> takes about 2 hours. At the same time this task with Hive source takes a few 
> minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
> for batch processing with spark and I should  duplicate data to Hive and 
> process it with Hive?
> 

Reply via email to