Hi, Phoenix is able to parallelize queries based on the underlying HBase region splits, as well as its own internal guideposts based on statistics collection [1]
The phoenix-spark connector exposes those splits to Spark for the RDD / DataFrame parallelism. In order to test this out, you can try run an EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many parallel scans will be run, and then compare those to the RDD / DataFrame partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3], they will be the same. In versions below that, the partition count will equal the number of regions for that table. Josh [1] https://phoenix.apache.org/update_statistics.html [2] https://phoenix.apache.org/tuning_guide.html [3] https://issues.apache.org/jira/browse/PHOENIX-3600 On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kana...@gmail.com> wrote: > Also, I'm using phoenixTableAsDataFrame API to read from a pre-split > phoenix table. How can we ensure read is parallelized across all executors? > Would salting/pre-splitting tables help in providing parallelism? > Appreciate any inputs. > > Thanks > > > Kanagha > > On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kana...@gmail.com> wrote: > >> Hi Josh, >> >> Per your previous post, it is mentioned "The phoenix-spark parallelism is >> based on the splits provided by the Phoenix query planner, and has no >> requirements on specifying partition columns or upper/lower bounds." >> >> Does it depend upon the region splits on the input table for parallelism? >> Could you please provide more details? >> >> >> Thanks >> >> >> >> -- >> View this message in context: http://apache-phoenix-user-lis >> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint- >> query-in-dbtable-tp1915p3810.html >> Sent from the Apache Phoenix User List mailing list archive at Nabble.com. >> > >