Re: Patterns for making multiple aggregations in one pass

Nicholas Chammas Wed, 18 Jun 2014 19:11:18 -0700

This is exciting! Here is the relevant "alpha" doc
<http://yhuai.github.io/site/sql-programming-guide.html#json-datasets> for
this feature, for others reading this. I'm going to try this out.


Will this be released with 1.1.0?


On Wed, Jun 18, 2014 at 8:31 PM, Zongheng Yang <zonghen...@gmail.com> wrote:

> If your input data is JSON, you can also try out the recently merged
> in initial JSON support:
>
> https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916
>
> On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > That’s pretty neat! So I guess if you start with an RDD of objects, you’d
> > first do something like RDD.map(lambda x: Record(x['field_1'],
> x['field_2'],
> > ...)) in order to register it as a table, and from there run your
> > aggregates. Very nice.
> >
> >
> >
> > On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com>
> > wrote:
> >>
> >> This looks like a job for SparkSQL!
> >>
> >>
> >> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> >> import sqlContext._
> >> case class MyRecord(country: String, name: String, age: Int, hits: Long)
> >> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
> >> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
> >> data.registerAsTable("MyRecords")
> >> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM
> >> MyRecords t GROUP BY t.country""").collect
> >>
> >> Now "results" contains:
> >>
> >> Array[org.apache.spark.sql.Row] = Array([France,33.0,72],
> [USA,39.5,342])
> >>
> >>
> >>
> >> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com>
> wrote:
> >>>
> >>> Hi Nick,
> >>>
> >>> Instead of using reduceByKey(), you might want to look into using
> >>> aggregateByKey(), which allows you to return a different value type U
> >>> instead of the input value type V for each input tuple (K, V). You can
> >>> define U to be a datatype that holds both the average and total and
> have
> >>> seqOp update both fields of U in a single pass.
> >>>
> >>> Hope this makes sense,
> >>> Doris
> >>>
> >>>
> >>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas
> >>> <nicholas.cham...@gmail.com> wrote:
> >>>>
> >>>> The following is a simplified example of what I am trying to
> accomplish.
> >>>>
> >>>> Say I have an RDD of objects like this:
> >>>>
> >>>> {
> >>>>     "country": "USA",
> >>>>     "name": "Franklin",
> >>>>     "age": 24,
> >>>>     "hits": 224
> >>>> }
> >>>> {
> >>>>
> >>>>     "country": "USA",
> >>>>     "name": "Bob",
> >>>>     "age": 55,
> >>>>     "hits": 108
> >>>> }
> >>>> {
> >>>>
> >>>>     "country": "France",
> >>>>     "name": "Remi",
> >>>>     "age": 33,
> >>>>     "hits": 72
> >>>> }
> >>>>
> >>>> I want to find the average age and total number of hits per country.
> >>>> Ideally, I would like to scan the data once and perform both
> aggregations
> >>>> simultaneously.
> >>>>
> >>>> What is a good approach to doing this?
> >>>>
> >>>> I’m thinking that we’d want to keyBy(country), and then somehow
> >>>> reduceByKey(). The problem is, I don’t know how to approach writing a
> >>>> function that can be passed to reduceByKey() and that will track a
> running
> >>>> average and total simultaneously.
> >>>>
> >>>> Nick
> >>>>
> >>>>
> >>>> ________________________________
> >>>> View this message in context: Patterns for making multiple
> aggregations
> >>>> in one pass
> >>>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>
> >
>

Re: Patterns for making multiple aggregations in one pass

Reply via email to