This is exciting! Here is the relevant "alpha" doc <http://yhuai.github.io/site/sql-programming-guide.html#json-datasets> for this feature, for others reading this. I'm going to try this out.
Will this be released with 1.1.0? On Wed, Jun 18, 2014 at 8:31 PM, Zongheng Yang <zonghen...@gmail.com> wrote: > If your input data is JSON, you can also try out the recently merged > in initial JSON support: > > https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 > > On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > That’s pretty neat! So I guess if you start with an RDD of objects, you’d > > first do something like RDD.map(lambda x: Record(x['field_1'], > x['field_2'], > > ...)) in order to register it as a table, and from there run your > > aggregates. Very nice. > > > > > > > > On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com> > > wrote: > >> > >> This looks like a job for SparkSQL! > >> > >> > >> val sqlContext = new org.apache.spark.sql.SQLContext(sc) > >> import sqlContext._ > >> case class MyRecord(country: String, name: String, age: Int, hits: Long) > >> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), > >> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72))) > >> data.registerAsTable("MyRecords") > >> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM > >> MyRecords t GROUP BY t.country""").collect > >> > >> Now "results" contains: > >> > >> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], > [USA,39.5,342]) > >> > >> > >> > >> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> > wrote: > >>> > >>> Hi Nick, > >>> > >>> Instead of using reduceByKey(), you might want to look into using > >>> aggregateByKey(), which allows you to return a different value type U > >>> instead of the input value type V for each input tuple (K, V). You can > >>> define U to be a datatype that holds both the average and total and > have > >>> seqOp update both fields of U in a single pass. > >>> > >>> Hope this makes sense, > >>> Doris > >>> > >>> > >>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas > >>> <nicholas.cham...@gmail.com> wrote: > >>>> > >>>> The following is a simplified example of what I am trying to > accomplish. > >>>> > >>>> Say I have an RDD of objects like this: > >>>> > >>>> { > >>>> "country": "USA", > >>>> "name": "Franklin", > >>>> "age": 24, > >>>> "hits": 224 > >>>> } > >>>> { > >>>> > >>>> "country": "USA", > >>>> "name": "Bob", > >>>> "age": 55, > >>>> "hits": 108 > >>>> } > >>>> { > >>>> > >>>> "country": "France", > >>>> "name": "Remi", > >>>> "age": 33, > >>>> "hits": 72 > >>>> } > >>>> > >>>> I want to find the average age and total number of hits per country. > >>>> Ideally, I would like to scan the data once and perform both > aggregations > >>>> simultaneously. > >>>> > >>>> What is a good approach to doing this? > >>>> > >>>> I’m thinking that we’d want to keyBy(country), and then somehow > >>>> reduceByKey(). The problem is, I don’t know how to approach writing a > >>>> function that can be passed to reduceByKey() and that will track a > running > >>>> average and total simultaneously. > >>>> > >>>> Nick > >>>> > >>>> > >>>> ________________________________ > >>>> View this message in context: Patterns for making multiple > aggregations > >>>> in one pass > >>>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> > >> > > >