Re: stream caching

Akhil Das Tue, 14 Apr 2015 00:27:08 -0700

See the inline response, hope it helps.

On Fri, Apr 10, 2015 at 10:56 AM, Vinay Kesarwani <vnkesarw...@gmail.com>
wrote:


> Hi,
>
> I have following scenario.. need some help ASAP
>
> 1. Ad hoc query on spark streaming.
>    How can i run spark queries on ongoing streaming context.
>    Scenario: If a stream job running to find out min and max value in last
> 5 min(which i am able to do.)
>    Now i want to run interactive query to find min max in last 30 min on
> this stream.
>    What i was thinking to store the streaming RDD as schemaRDD and do
> query on that.Is there any better approach??
>    Where should i store schemaRDD for near real time performance??
>

Wouldn't window based operations be sufficient for this?


2. Saving and loading intermediate RDDs in cache/disk.
>    What is the best approach to do this. In case any worker fails ,
> whether new worker will resume task,load this saved RDDs??
>

Enable checkpointing, and if you use WAL (depending on your data source)
there will be no data loss, in case of worker node failures, then all tasks
being assigned to that worker which are yet to complete will be spawned on
another machines.



> 3. Write ahead log and Check point.
>    How are the significance of WAL, and checkpoint?? In case of checkpoint
> if any worker fails will other worker load checkpoint detail and resume its
> job??
>

Yes, you just need to point to a fault-tolerant file system (may be enable
high availability/replication in your HDFS)



>    What scenarios i should use WAL and Checkpoint.
> 4. Spawning multiple processes within spark streaming.
>    Doing multiple operations on same stream.
>
You mean spawning multiple threads? or forking new processing? either way,
it will be a headache controlling them.



> 5. Accessing cached data between spark components.
>    Can cached data in spark streaming is accessible to spark sql?? Can it
> be shared between these component? or can it be between to sparkcontext?
>

Why not do like myCachedStream.foreachRDD( PUT SparkSQL here! )


   If yes how? if not any alternative approach?
> 6. Dynamic look up data in spark streaming.
>    I have a scenario where on a stream i want to do some filtering using
> dynamic lookup data. How can i achieve this scenario?
>    In case i get this lookup data as another stream, and cache it..will it
> possible to updata/merge this data in cache in 24/7?
> What is the best approach to do this. I refered Twitter streaming example
> in spark where it reads a spamfile. but this file is not dynamic in nature.
>

Again, if your data is being changed often, its better to do that inside a
foreachRDD of your dstream, something like:


 myStream.foreachRDD(rdd => {


      val file = ssc.sparkContext.textFile("/sigmoid/spam/")
      //Do whatever you want in here!

    })

Re: stream caching

Reply via email to