Re: How do I convert a data frame to broadcast variable?
Awesome, thanks Silvio! From: Silvio Fiorito mailto:silvio.fior...@granturing.com>> Date: Thursday, November 3, 2016 at 12:26 PM To: "Jain, Nishit" mailto:nja...@underarmour.com>>, Denny Lee mailto:denny.g@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: How do I convert a data frame to broadcast variable? Hi Nishit, Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you make on the dataframe will get materialized in the query sent via JDBC. You should be able to verify this by looking at the physical query plan: val df = sqlContext.jdbc().select($"col1", $"col2") df.explain(true) Or if you can easily log queries submitted to your database then you can view the specific query. Thanks, Silvio From: Jain, Nishit mailto:nja...@underarmour.com>> Sent: Thursday, November 3, 2016 12:32:48 PM To: Denny Lee; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: How do I convert a data frame to broadcast variable? Thanks Denny! That does help. I will give that a shot. Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame. This function from data frame reader does not give an option to read only certain columns: def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table): new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T]) May be if I do df.select(col1, col2) on data frame created via a table, will spark be smart enough to fetch only two columns not entire table? Any way to test this? From: Denny Lee mailto:denny.g@gmail.com>> Date: Thursday, November 3, 2016 at 10:59 AM To: "Jain, Nishit" mailto:nja...@underarmour.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: How do I convert a data frame to broadcast variable? If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html HTH! On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit mailto:nja...@underarmour.com>> wrote: I have a lookup table in HANA database. I want to create a spark broadcast variable for it. What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable? Thanks, Nishit
Re: How do I convert a data frame to broadcast variable?
Hi Nishit, Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you make on the dataframe will get materialized in the query sent via JDBC. You should be able to verify this by looking at the physical query plan: val df = sqlContext.jdbc().select($"col1", $"col2") df.explain(true) Or if you can easily log queries submitted to your database then you can view the specific query. Thanks, Silvio From: Jain, Nishit Sent: Thursday, November 3, 2016 12:32:48 PM To: Denny Lee; user@spark.apache.org Subject: Re: How do I convert a data frame to broadcast variable? Thanks Denny! That does help. I will give that a shot. Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame. This function from data frame reader does not give an option to read only certain columns: def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table): new JdbcRDD(sc: SparkContext, getConnection: () ? Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ? T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T]) May be if I do df.select(col1, col2) on data frame created via a table, will spark be smart enough to fetch only two columns not entire table? Any way to test this? From: Denny Lee mailto:denny.g@gmail.com>> Date: Thursday, November 3, 2016 at 10:59 AM To: "Jain, Nishit" mailto:nja...@underarmour.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: How do I convert a data frame to broadcast variable? If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html HTH! On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit mailto:nja...@underarmour.com>> wrote: I have a lookup table in HANA database. I want to create a spark broadcast variable for it. What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable? Thanks, Nishit
Re: How do I convert a data frame to broadcast variable?
Thanks Denny! That does help. I will give that a shot. Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame. This function from data frame reader does not give an option to read only certain columns: def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table): new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T]) May be if I do df.select(col1, col2) on data frame created via a table, will spark be smart enough to fetch only two columns not entire table? Any way to test this? From: Denny Lee mailto:denny.g@gmail.com>> Date: Thursday, November 3, 2016 at 10:59 AM To: "Jain, Nishit" mailto:nja...@underarmour.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: How do I convert a data frame to broadcast variable? If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html HTH! On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit mailto:nja...@underarmour.com>> wrote: I have a lookup table in HANA database. I want to create a spark broadcast variable for it. What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable? Thanks, Nishit
Re: How do I convert a data frame to broadcast variable?
If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html HTH! On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit wrote: > I have a lookup table in HANA database. I want to create a spark broadcast > variable for it. > What would be the suggested approach? Should I read it as an data frame > and convert data frame into broadcast variable? > > Thanks, > Nishit >