RE: How are Dataframes partitioned by default when using spark?

2016-09-29 Thread Long, Xindian
Hi, Josh:

Thanks for the reply. I still have some questions/comments

The phoenix-spark integration inherits the underlying splits provided by 
Phoenix, which is a function of the HBase regions, salting and other aspects 
determined by the Phoenix Query Planner.

XD: Is there any documentation on what this function actually is ?

Re: #1, as I understand the Spark JDBC connector, it evenly segments the range, 
although it will only work on a numeric column, not a compound row key.

Re: #2, again, as I understand Spark JDBC, I don't believe that's an option, or 
perhaps it will default to only providing 1 partition, i.e, one very large 
query.

Re: data-locality, the underlying Phoenix Hadoop Input Format isn't yet 
node-aware. There are some data locality advantages gained by co-locating the 
Spark executors to the RegionServers, but it could be improved. It's worth 
filing a JIRA enhancement ticket for that.

XD: A JIRA enhancement will be great.

Thanks

Xindian

On Mon, Sep 19, 2016 at 12:48 PM, Long, Xindian 
> wrote:
How are Dataframes/Datasets/RDD  partitioned by default when using spark? 
assuming the Dataframe/Datasets/RDD  is the result of a query like that:

select col1, col2, col3 from table3 where col3 > xxx

I noticed that for HBase, a partitioner partitions the rowkeys based on region 
splits,  can Phoenix do this as well?

I also read that if I use spark with the Phoenix jdbc interface “it’s only able 
to parallelize queries by partioning on a numeric column. It also requires a 
known lower bound, upper bound and partition count in order to create split 
queries.”

Question 1,  If I specify an option like this, is the partitioning based on 
segmenting the range evenly, i.e. each partition gets a rowkey in ranges like: 
upperlimit-lowerlmit)/partitionCount ?

Question 2, if I do not specify any range, or the row key is not a numeric 
column, how is the result partitioned using jdbc?


If I use the spark-phoenix  plug in, it is mentioned that it is able to 
leverage the underlying splits provided by Phoenix?
Are there any example scenarios  of that? e.g. can it partition the resulted 
Dataframe based on regions in the underling HBase table, so that spark can take 
advantage the locality of the data?

Thanks

Xindian



How are Dataframes partitioned by default when using spark?

2016-09-19 Thread Long, Xindian
How are Dataframes/Datasets/RDD  partitioned by default when using spark? 
assuming the Dataframe/Datasets/RDD  is the result of a query like that:

select col1, col2, col3 from table3 where col3 > xxx

I noticed that for HBase, a partitioner partitions the rowkeys based on region 
splits,  can Phoenix do this as well?

I also read that if I use spark with the Phoenix jdbc interface "it's only able 
to parallelize queries by partioning on a numeric column. It also requires a 
known lower bound, upper bound and partition count in order to create split 
queries."

Question 1,  If I specify an option like this, is the partitioning based on 
segmenting the range evenly, i.e. each partition gets a rowkey in ranges like: 
upperlimit-lowerlmit)/partitionCount ?

Question 2, if I do not specify any range, or the row key is not a numeric 
column, how is the result partitioned using jdbc?


If I use the spark-phoenix  plug in, it is mentioned that it is able to 
leverage the underlying splits provided by Phoenix?
Are there any example scenarios  of that? e.g. can it partition the resulted 
Dataframe based on regions in the underling HBase table, so that spark can take 
advantage the locality of the data?

Thanks

Xindian