Re: How do I convert a data frame to broadcast variable?

2016-11-04 Thread Jain, Nishit
Awesome, thanks Silvio!

From: Silvio Fiorito 
mailto:silvio.fior...@granturing.com>>
Date: Thursday, November 3, 2016 at 12:26 PM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
Denny Lee mailto:denny.g@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?


Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any 
selection you make on the dataframe will get materialized in the query sent via 
JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc().select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view 
the specific query.


Thanks,

Silvio


From: Jain, Nishit mailto:nja...@underarmour.com>>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Silvio Fiorito
Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any 
selection you make on the dataframe will get materialized in the query sent via 
JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc().select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view 
the specific query.


Thanks,

Silvio


From: Jain, Nishit 
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ? Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ? T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit  wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>