I don't know if this is the most efficient way to do that, but you can use
a sliding window that is bigger than your aggregation period and filter
only for the messages inside the period.

Remember that to work with the reduceByKeyAndWindow you need to associate
each row with the time key, in your case "yyyyMMddhhmm".

http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Hope it helps,

Regards,

Ricardo



On Sat, Jan 16, 2016 at 1:13 AM, ffarozan [via Apache Spark User List] <
ml-node+s1001560n25982...@n3.nabble.com> wrote:

> I am implementing aggregation using spark streaming and kafka. My batch
> and window size are same. And the aggregated data is persisted in
> Cassandra.
>
> I want to aggregate for fixed time windows - 5:00, 5:05, 5:10, ...
>
> But we cannot control when to run streaming job, we only get to specify
> the batch interval.
>
> So the problem is - lets say if streaming job starts at 5:02, then I will
> get results at 5:07, 5:12, etc. and not what I want.
>
> Any suggestions?
>
> thanks,
> Firdousi
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982p25990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to