That’s pretty neat! So I guess if you start with an RDD of objects, you’d first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'], ...)) in order to register it as a table, and from there run your aggregates. Very nice.
On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote: > This looks like a job for SparkSQL! > > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext._ > case class MyRecord(country: String, name: String, age: Int, hits: Long) > val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), > MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72))) > data.registerAsTable("MyRecords") > val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM > MyRecords t GROUP BY t.country""").collect > > Now "results" contains: > > Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342]) > > > On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote: > >> Hi Nick, >> >> Instead of using reduceByKey(), you might want to look into using >> aggregateByKey(), which allows you to return a different value type U >> instead of the input value type V for each input tuple (K, V). You can >> define U to be a datatype that holds both the average and total and have >> seqOp update both fields of U in a single pass. >> >> Hope this makes sense, >> Doris >> >> >> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com >> > wrote: >> >>> The following is a simplified example of what I am trying to accomplish. >>> >>> Say I have an RDD of objects like this: >>> >>> { >>> "country": "USA", >>> "name": "Franklin", >>> "age": 24, >>> "hits": 224} >>> { >>> >>> "country": "USA", >>> "name": "Bob", >>> "age": 55, >>> "hits": 108} >>> { >>> >>> "country": "France", >>> "name": "Remi", >>> "age": 33, >>> "hits": 72} >>> >>> I want to find the average age and total number of hits per country. >>> Ideally, I would like to scan the data once and perform both aggregations >>> simultaneously. >>> >>> What is a good approach to doing this? >>> >>> I’m thinking that we’d want to keyBy(country), and then somehow >>> reduceByKey(). The problem is, I don’t know how to approach writing a >>> function that can be passed to reduceByKey() and that will track a >>> running average and total simultaneously. >>> >>> Nick >>> >>> >>> ------------------------------ >>> View this message in context: Patterns for making multiple aggregations >>> in one pass >>> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html> >>> Sent from the Apache Spark User List mailing list archive >>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>> >> >> >