Hi  Lian,

This could be the solution


case class Symbol(symbol: String, sector: String)

case class Tick(symbol: String, sector: String, open: Double, close: Double)


// symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]


    symbolDs.map { k =>

      pullSymbolFromYahoo(k.symbol, k.sector)

    }(org.apache.spark.sql.Encoders.kryo[Tick.getClass])


Thanks,

Snehasish


Regards,
Snehasish

On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau <holden.ka...@gmail.com>
wrote:

> I'm not sure what you mean by it could be hard to serialize complex
> operations?
>
> Regardless I think the question is do you want to parallelize this on
> multiple machines or just one?
>
> On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote:
>
>> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However,
>> it could be hard to serialize complex operation for Spark to execute in
>> parallel. IMHO, spark does not fit this scenario. Hope this makes sense.
>>
>> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> ** You do NOT need dataframes, I mean.....
>>>
>>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Couple of suggestions:
>>>>
>>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>>>> benefit of dataset features here. Using Dataframe, you can write an
>>>> arbitrary UDF which can do what you want to do.
>>>> 2. In fact you do need dataframes here. You would be better off with
>>>> RDD here. just create a RDD of symbols and use map to do the processing.
>>>>
>>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com>
>>>> wrote:
>>>>
>>>>> Do you only want to use Scala? Because otherwise, I think with pyspark
>>>>> and pandas read table you should be able to accomplish what you want to
>>>>> accomplish.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Irving Duran
>>>>>
>>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a user case:
>>>>>
>>>>> I want to download S&P500 stock data from Yahoo API in parallel using
>>>>> Spark. I have got all stock symbols as a Dataset. Then I used below code 
>>>>> to
>>>>> call Yahoo API for each symbol:
>>>>>
>>>>>
>>>>>
>>>>> case class Symbol(symbol: String, sector: String)
>>>>>
>>>>> case class Tick(symbol: String, sector: String, open: Double, close:
>>>>> Double)
>>>>>
>>>>>
>>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>>>> Dataset[Tick]
>>>>>
>>>>>
>>>>>     symbolDs.map { k =>
>>>>>
>>>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>>>
>>>>>     }
>>>>>
>>>>>
>>>>> This statement cannot compile:
>>>>>
>>>>>
>>>>> Unable to find encoder for type stored in a Dataset.  Primitive types
>>>>> (Int, String, etc) and Product types (case classes) are supported by
>>>>> importing spark.implicits._  Support for serializing other types will
>>>>> be added in future releases.
>>>>>
>>>>>
>>>>> My questions are:
>>>>>
>>>>>
>>>>> 1. As you can see, this scenario is not traditional dataset handling
>>>>> such as count, sql query... Instead, it is more like a UDF which apply
>>>>> random operation on each record. Is Spark good at handling such scenario?
>>>>>
>>>>>
>>>>> 2. Regarding the compilation error, any fix? I did not find a
>>>>> satisfactory solution online.
>>>>>
>>>>>
>>>>> Thanks for help!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>

Reply via email to