Thanks Dean. 

I gather if I wanted to get the whole thing through FP with little or no
use of SQL, then for the first instance as I get the data set from Hive
(i.e, 

val rs = HiveContext.sql("""SELECT t.calendar_month_desc,
c.channel_desc, SUM(s.amount_sold) AS TotalSales
FROM smallsales s, times t, channels c
WHERE s.time_id = t.time_id
AND s.channel_id = c.channel_id
GROUP BY t.calendar_month_desc, c.channel_desc
""") 

I can even possibly use DF to do the above sql joins because what I am
doing above can also be done in DF without SQL use? Is that possible? Or
I have to use some form of SQL? 

The rest I can do simply using DFs. 

Thanks 

On 23/02/2016 00:26, Dean Wampler wrote: 

> Kevin gave you the answer you need, but I'd like to comment on your subject 
> line. SQL is a limited form of FP. Sure, there are no anonymous functions and 
> other limitations, but it's declarative, like good FP programs should be, and 
> it offers an important subset of the operators ("combinators") you want. 
> 
> Also, on a practical note, use the DataFrame API whenever you can, rather 
> than dropping down to the RDD API, because the DataFrame API is far more 
> performant. It's a classic case where restricting your options enables more 
> aggressive optimizations behind the scenes. Michal Armbrust's talk at Spark 
> Summit East nicely made this point. 
> http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming
>  [3] 
> 
> dean 
> 
> Dean Wampler, Ph.D. 
> Author: Programming Scala, 2nd Edition [4] (O'Reilly)
> 
> Typesafe [5]
> @deanwampler [6] 
> http://polyglotprogramming.com [7] 
> 
> On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott <kevin.r.mell...@gmail.com> 
> wrote:
> 
> In your example, the _rs_ instance should be a DataFrame object. In other 
> words, the result of _HiveContext.sql_ is a DataFrame that you can manipulate 
> using _filter, map, _etc. 
> 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  [8] 
> 
> On Mon, Feb 22, 2016 at 5:16 PM, Mich Talebzadeh 
> <mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> 
> Hi, 
> 
> I have data stored in Hive tables that I want to do simple manipulation. 
> 
> Currently in Spark I perform the following with getting the result set using 
> SQL from Hive tables, registering as a temporary table in Spark 
> 
> Now Ideally I can get the result set into a DF and work on DF to slice and 
> dice the data using functional programming with filter, map. split etc. 
> 
> I wanted to get some ideas on how to go about it. 
> 
> thanks 
> 
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 
> 
> HiveContext.sql("use oraclehadoop")
> val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc, 
> SUM(s.amount_sold) AS TotalSales
> FROM smallsales s, times t, channels c
> WHERE s.time_id = t.time_id
> AND s.channel_id = c.channel_id
> GROUP BY t.calendar_month_desc, c.channel_desc
> """)
> RS.REGISTERTEMPTABLE("TMP") 
> 
> HiveContext.sql("""
> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
> from tmp
> ORDER BY MONTH, CHANNEL
> """).collect.foreach(println)
> HiveContext.sql("""
> SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES
> FROM tmp
> GROUP BY channel_desc
> order by SALES DESC
> """).collect.foreach(println) 
> 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [1]
> 
> http://talebzadehmich.wordpress.com [2]
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Links:
------
[1]
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
[2] http://talebzadehmich.wordpress.com
[3]
http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming
[4] http://shop.oreilly.com/product/0636920033073.do
[5] http://typesafe.com
[6] http://twitter.com/deanwampler
[7] http://polyglotprogramming.com
[8]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext

Reply via email to