Hi Lian, This could be the solution
case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] symbolDs.map { k => pullSymbolFromYahoo(k.symbol, k.sector) }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) Thanks, Snehasish Regards, Snehasish On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau <holden.ka...@gmail.com> wrote: > I'm not sure what you mean by it could be hard to serialize complex > operations? > > Regardless I think the question is do you want to parallelize this on > multiple machines or just one? > > On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote: > >> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, >> it could be hard to serialize complex operation for Spark to execute in >> parallel. IMHO, spark does not fit this scenario. Hope this makes sense. >> >> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> ** You do NOT need dataframes, I mean..... >>> >>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> Couple of suggestions: >>>> >>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no >>>> benefit of dataset features here. Using Dataframe, you can write an >>>> arbitrary UDF which can do what you want to do. >>>> 2. In fact you do need dataframes here. You would be better off with >>>> RDD here. just create a RDD of symbols and use map to do the processing. >>>> >>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com> >>>> wrote: >>>> >>>>> Do you only want to use Scala? Because otherwise, I think with pyspark >>>>> and pandas read table you should be able to accomplish what you want to >>>>> accomplish. >>>>> >>>>> Thank you, >>>>> >>>>> Irving Duran >>>>> >>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a user case: >>>>> >>>>> I want to download S&P500 stock data from Yahoo API in parallel using >>>>> Spark. I have got all stock symbols as a Dataset. Then I used below code >>>>> to >>>>> call Yahoo API for each symbol: >>>>> >>>>> >>>>> >>>>> case class Symbol(symbol: String, sector: String) >>>>> >>>>> case class Tick(symbol: String, sector: String, open: Double, close: >>>>> Double) >>>>> >>>>> >>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>>>> Dataset[Tick] >>>>> >>>>> >>>>> symbolDs.map { k => >>>>> >>>>> pullSymbolFromYahoo(k.symbol, k.sector) >>>>> >>>>> } >>>>> >>>>> >>>>> This statement cannot compile: >>>>> >>>>> >>>>> Unable to find encoder for type stored in a Dataset. Primitive types >>>>> (Int, String, etc) and Product types (case classes) are supported by >>>>> importing spark.implicits._ Support for serializing other types will >>>>> be added in future releases. >>>>> >>>>> >>>>> My questions are: >>>>> >>>>> >>>>> 1. As you can see, this scenario is not traditional dataset handling >>>>> such as count, sql query... Instead, it is more like a UDF which apply >>>>> random operation on each record. Is Spark good at handling such scenario? >>>>> >>>>> >>>>> 2. Regarding the compilation error, any fix? I did not find a >>>>> satisfactory solution online. >>>>> >>>>> >>>>> Thanks for help! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >>