that variable "x" would be a DataFrame which is an alias of Dataset in the last versions. you can do your map operation by doing x.map(case Row(f1:String, f2:Int, ....) => [your code]). f1 and f2 stands for the columns of your dataset with the type. in the code you can use f1 and f2 as variables to make your map function.
On Thu, Mar 23, 2017 at 2:58 AM, Keith Chapman <[email protected]> wrote: > Thanks for the advice Diego, that was very helpful. How could I read the > csv as a dataset though? I need to do a map operation over the dataset, I > just coded up an example to illustrate the issue > > On Mar 22, 2017 6:43 PM, "Diego Fanesi" <[email protected]> wrote: > >> You are using spark as a library but it is much more than that. The book >> "learning Spark" is very well done and it helped me a lot starting with >> spark. Maybe you should start from there. >> >> Those are the issues in your code: >> >> Basically, you generally don't execute spark code like that. You could >> but it is not officially supported and many functions don't work in that >> way. You should start your local cluster made of master and single worker, >> then make a jar with your code and use spark-submit to send it to the >> cluster. >> >> You generally never use args because spark is a multiprocess, >> multi-thread application so args will not be available everywhere. >> >> All contexts have been merged into the same context in the last versions >> of spark. so you will need to do something like this: >> >> import org.apache.spark.sql.{DataFrame, SparkSession} >> >> object DatasetTest{ >> >> val spark: SparkSession = SparkSession >> .builder() .master("local[8]") >> .appName("Spark basic example").getOrCreate() >> >> import spark.implicits._ >> >> def main(Args: Array[String]) { >> >> var x = spark.read.format("csv").load("/home/user/data.csv") >> >> x.show() >> >> } >> >> } >> >> >> hope this helps. >> >> Diego >> >> On 22 Mar 2017 7:18 pm, "Keith Chapman" <[email protected]> wrote: >> >> Hi, >> >> I'm trying to read in a CSV file into a Dataset but keep having >> compilation issues. I'm using spark 2.1 and the following is a small >> program that exhibit the issue I'm having. I've searched around but not >> found a solution that worked, I've added "import sqlContext.implicits._" as >> suggested but no luck. What am I missing? Would appreciate some advice. >> >> import org.apache.spark.sql.functions._ >> import org.apache.spark.{SparkContext, SparkConf} >> import org.apache.spark.sql.{Encoder,Encoders} >> >> object DatasetTest{ >> >> def main(args: Array[String]) { >> val sparkConf = new SparkConf().setAppName("DatasetTest") >> val sc = new SparkContext(sparkConf) >> case class Foo(text: String) >> val sqlContext = new org.apache.spark.sql.SQLContext(sc) >> import sqlContext.implicits._ >> val ds : org.apache.spark.sql.Dataset[Foo] = >> sqlContext.read.csv(args(1)).as[Foo] >> ds.show >> } >> } >> >> Compiling the above program gives, I'd expect it to work as its a simple >> case class, changing it to as[String] works, but I would like to get the >> case class to work. >> >> [error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder >> for type stored in a Dataset. Primitive types (Int, String, etc) and >> Product types (case classes) are supported by importing spark.implicits._ >> Support for serializing other types will be added in future releases. >> [error] val ds : org.apache.spark.sql.Dataset[Foo] = >> sqlContext.read.csv(args(1)).as[Foo] >> >> >> Regards, >> Keith. >> >> >>
