which version you use? I passed in 2.0-preview as follows; --- Spark context available as 'sc' (master = local[*], app id = local-1466234043659).
Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-preview /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31) Type in expressions to have them evaluated. Type :help for more information. scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] scala> ds.groupBy($"_1").count.select($"_1", $"count").show +---+-----+ | _1|count| +---+-----+ | 1| 1| | 2| 1| +---+-----+ On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > I went ahead and downloaded/compiled Spark 2.0 to try your code snippet > Takeshi. It unfortunately doesn't compile. > > scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS > ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] > > scala> ds.groupBy($"_1").count.select($"_1", $"count").show > <console>:28: error: type mismatch; > found : org.apache.spark.sql.ColumnName > required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row, > Long),?] > ds.groupBy($"_1").count.select($"_1", $"count").show > ^ > > I also gave a try to Xinh's suggestion using the code snippet below > (partially from spark docs) > scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), Person("Pedro", > 24), Person("Bob", 42)).toDS() > scala> ds.groupBy(_.name).count.select($"name".as[String]).show > org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input > columns: []; > scala> ds.groupBy(_.name).count.select($"_1".as[String]).show > org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input > columns: []; > scala> ds.groupBy($"name").count.select($"_1".as[String]).show > org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input > columns: []; > > Looks like there are empty columns for some reason, the code below works > fine for the simple aggregate > scala> ds.groupBy(_.name).count.show > > Would be great to see an idiomatic example of using aggregates like these > mixed with spark.sql.functions. > > Pedro > > On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> Thanks Xinh and Takeshi, >> >> I am trying to avoid map since my impression is that this uses a Scala >> closure so is not optimized as well as doing column-wise operations is. >> >> Looks like the $ notation is the way to go, thanks for the help. Is there >> an explanation of how this works? I imagine it is a method/function with >> its name defined as $ in Scala? >> >> Lastly, are there prelim Spark 2.0 docs? If there isn't a good >> description/guide of using this syntax I would be willing to contribute >> some documentation. >> >> Pedro >> >> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Hi, >>> >>> In 2.0, you can say; >>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >>> ds.groupBy($"_1").count.select($"_1", $"count").show >>> >>> >>> // maropu >>> >>> >>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> >>> wrote: >>> >>>> Hi Pedro, >>>> >>>> In 1.6.1, you can do: >>>> >> ds.groupBy(_.uid).count().map(_._1) >>>> or >>>> >> ds.groupBy(_.uid).count().select($"value".as[String]) >>>> >>>> It doesn't have the exact same syntax as for DataFrame. >>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >>>> >>>> It might be different in 2.0. >>>> >>>> Xinh >>>> >>>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez < >>>> ski.rodrig...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its >>>>> released. >>>>> >>>>> I am running the aggregate code below where I have a dataset where the >>>>> row has a field uid: >>>>> >>>>> ds.groupBy(_.uid).count() >>>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, >>>>> _2: bigint] >>>>> >>>>> This works as expected, however, attempts to run select statements >>>>> after fails: >>>>> ds.groupBy(_.uid).count().select(_._1) >>>>> // error: missing parameter type for expanded function ((x$2) => >>>>> x$2._1) >>>>> ds.groupBy(_.uid).count().select(_._1) >>>>> >>>>> I have tried several variants, but nothing seems to work. Below is the >>>>> equivalent Dataframe code which works as expected: >>>>> df.groupBy("uid").count().select("uid") >>>>> >>>>> Thanks! >>>>> -- >>>>> Pedro Rodriguez >>>>> PhD Student in Distributed Machine Learning | CU Boulder >>>>> UC Berkeley AMPLab Alumni >>>>> >>>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>>> Github: github.com/EntilZha | LinkedIn: >>>>> https://www.linkedin.com/in/pedrorodriguezscience >>>>> >>>>> >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- --- Takeshi Yamamuro