Some more details... We have done some simple tests to compare read/write possibility spark+hive and spark+phoenix. And now we have the following results:
Copy table (with no any transformations) (about 800 million rec): Hive (TEZ) - 752 sec Spark: >From Hive to Hive: 2463 sec >From Phoenix to Hive - 13310 sec >From Hive to Phoenix - > 30240 sec We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1 So it seems that Spark + Phoenix led great performance degradation. Any thoughts? On 2018/03/04 11:08:56, Stepan Migunov <stepan.migu...@firstlinesoftware.com> wrote: > In our software we need to combine fast interactive access to the data with > quite complex data processing. I know that Phoenix intended for fast access, > but hoped that also I could be able to use Phoenix as a source for complex > processing with the Spark. Unfortunately, Phoenix + Spark shows very poor > performance. E.g., querying big (about billion records) table with distinct > takes about 2 hours. At the same time this task with Hive source takes a few > minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable > for batch processing with spark and I should duplicate data to Hive and > process it with Hive? >