Hi,

or you could just use the structured streaming
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html


Regards,
Gourav Sengupta

On Tue, Aug 14, 2018 at 10:51 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi Aakash,
>
> In Spark Streaming, forEachRDD provides you access to the data in
> each micro batch.
> You can transform that RDD into a DataFrame and implement the flow you
> describe.
>
> eg.:
>
> var historyRDD:RDD[mytype] = sparkContext.emptyRDD
>
> // create Kafka Dstream ...
>
> dstream.foreachRDD{ rdd =>
>   val allData = historyRDD union rdd
>   val df = allData.toDF   // requires the RDD to be of some structured
> type. i.e. a case class
>   // do something with the dataframe df
> historyRDD = allData  // this needs checkpointing
> }
> Depending on the volume of data you're dealing with, it might not be
> possible to hold all data in memory.
> Checkpoint of the historyRDD is mandatory to break up the growing lineage
> (union will keep a reference to the previous RDDs and at some point, things
> will blow up)
> So, while this trick might keep data within the Spark boundaries, you
> still need resilient storage to write the checkpoints in order to implement
> a reliable streaming job.
>
> As you are using Kafka, another alternative would be to write the
> transformed data to Kafka and have the training job consume that topic,
> replaying data from the start.
> Confluent has some good resources on how to use "kafka as a storage"
>
> I  hope this helps.
>
> kr, Gerard.
>
> PS: I'm also not sure why you are initially writing the files to Kafka. It
> would be easier to read the files directly from Spark Streaming or
> Structured Streaming.
>
>
>
>
>
> On Tue, Aug 14, 2018 at 9:31 AM Aakash Basu <aakash.spark....@gmail.com>
> wrote:
>
>> Hi all,
>>
>> The requirement is, to process file using Spark Streaming fed from Kafka
>> Topic and once all the transformations are done, make it a batch of static
>> dataframe and pass it into a Spark ML Model tuning.
>>
>> As of now, I had been doing it in the below fashion -
>>
>> 1) Read the file using Kafka
>> 2) Consume it in Spark using a streaming dataframe
>> 3) Run spark transformation jobs on streaming data
>> 4) Append and write on HDFS.
>> 5) Read the transformed file as batch in Spark
>> 6) Run Spark ML Model
>>
>> But, the requirement is to avoid use of HDFS as it may not be installed
>> in certain clusters, so, we've to avoid the disk I/O and do it on the fly
>> from Kafka to append in a spark static DF and hence pass that DF to the ML
>> Model.
>>
>> How to go about it?
>>
>> Thanks,
>> Aakash.
>>
>

Reply via email to