Re: Can spark handle this scenario?

Lian Jiang Thu, 22 Feb 2018 12:06:06 -0800

Hi Vijay,

Should HTTPConnection() (or any other object created per partition) be
serializable so that your code work? If so, the usage seems to be limited.


Sometimes, the error caused by a non-serializable object can be very
misleading (e.g. "Return statements aren't allowed in Spark closures")
instead of "Task not serializable".

The post shared by Anastasios helps but does not completely resolve the
"need serialization" problem. For example, if I need to create per
partition class object that
relies on other objects which may not be serializable, then wrapping the
object creation in an object making it a static function does not help, not
mentioning
the programming model becomes unintuitive.

I have been played this scenario for some time and still frustrated. Thanks






On Tue, Feb 20, 2018 at 12:47 AM, vijay.bvp <bvpsa...@gmail.com> wrote:

> I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API
> with some token passed, in the code provided so far if you have 2000
> symbols, it will make 2000 new connections!! and 2000 API calls
> connection objects can't/shouldn't be serialized and send to executors,
> they
> should rather be created at executors.
>
> the philosophy given below is nicely documented on Spark Streaming, look at
> Design Patterns for using foreachRDD
> https://spark.apache.org/docs/latest/streaming-programming-
> guide.html#output-operations-on-dstreams
>
>
> case class Symbol(symbol: String, sector: String)
> case class Tick(symbol: String, sector: String, open: Double, close:
> Double)
> //assume symbolDs is rdd of symbol and tick dataset/dataframe can be
> converted to RDD
> symbolRdd.foreachPartition(partition => {
>        //this code runs at executor
>       //open a connection here -
>       val connectionToYahoo = new HTTPConnection()
>
>       partition.foreach(k => {
>           pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo)
>       }
> }
>
> with the above code if the dataset has 10 partitions (2000 symbols), only
> 10
> connections will be opened though it makes 2000 API calls.
> you should also be looking at sending and receiving results for large
> number
> of symbols, because of the amount of parallelism that spark provides you
> might run into rate limit on the APIs. if you are bulk sending symbols
> above
> pattern also very much useful
>
> thanks
> Vijay
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Can spark handle this scenario?

Reply via email to