That’s pretty neat! So I guess if you start with an RDD of objects, you’d
first do something like RDD.map(lambda x: Record(x['field_1'],
x['field_2'], ...)) in order to register it as a table, and from there run
your aggregates. Very nice.
​


On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

> This looks like a job for SparkSQL!
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> case class MyRecord(country: String, name: String, age: Int, hits: Long)
> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
> data.registerAsTable("MyRecords")
> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM
> MyRecords t GROUP BY t.country""").collect
>
> Now "results" contains:
>
> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])
>
>
> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote:
>
>> Hi Nick,
>>
>> Instead of using reduceByKey(), you might want to look into using
>> aggregateByKey(), which allows you to return a different value type U
>> instead of the input value type V for each input tuple (K, V). You can
>> define U to be a datatype that holds both the average and total and have
>> seqOp update both fields of U in a single pass.
>>
>> Hope this makes sense,
>> Doris
>>
>>
>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com
>> > wrote:
>>
>>> The following is a simplified example of what I am trying to accomplish.
>>>
>>> Say I have an RDD of objects like this:
>>>
>>> {
>>>     "country": "USA",
>>>     "name": "Franklin",
>>>     "age": 24,
>>>     "hits": 224}
>>> {
>>>
>>>     "country": "USA",
>>>     "name": "Bob",
>>>     "age": 55,
>>>     "hits": 108}
>>> {
>>>
>>>     "country": "France",
>>>     "name": "Remi",
>>>     "age": 33,
>>>     "hits": 72}
>>>
>>> I want to find the average age and total number of hits per country.
>>> Ideally, I would like to scan the data once and perform both aggregations
>>> simultaneously.
>>>
>>> What is a good approach to doing this?
>>>
>>> I’m thinking that we’d want to keyBy(country), and then somehow
>>> reduceByKey(). The problem is, I don’t know how to approach writing a
>>> function that can be passed to reduceByKey() and that will track a
>>> running average and total simultaneously.
>>>
>>> Nick
>>> ​
>>>
>>> ------------------------------
>>> View this message in context: Patterns for making multiple aggregations
>>> in one pass
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html>
>>> Sent from the Apache Spark User List mailing list archive
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>
>>
>>
>

Reply via email to