RE: Timed aggregation in Spark

2016-05-23 Thread Ewan Leith
Rather than open a connection per record, if you do a DStream foreachRDD at the 
end of a 5 minute batch window

http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

then you can do a rdd.foreachPartition to get the RDD partitions. Open a 
connection to vertica (or a pool of them) inside that mapPartitions, then do a 
partition.foreach to write each element from that partition to vertica, before 
finally closing the pool of connections.

Hope this helps,
Ewan

From: Nikhil Goyal [mailto:nownik...@gmail.com]
Sent: 23 May 2016 21:55
To: Ofir Kerker <ofir.ker...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: Timed aggregation in Spark

I don't think this is solving the problem. So here are the issues:
1) How do we push entire data to vertica. Opening a connection per record will 
be too costly
2) If a key doesn't come again, how do we push this to vertica
3) How do we schedule the dumping of data to avoid loading too much data in 
state.



On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker 
<ofir.ker...@gmail.com<mailto:ofir.ker...@gmail.com>> wrote:
Yes, check out mapWithState:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

_
From: Nikhil Goyal <nownik...@gmail.com<mailto:nownik...@gmail.com>>
Sent: Monday, May 23, 2016 23:28
Subject: Timed aggregation in Spark
To: <user@spark.apache.org<mailto:user@spark.apache.org>>


Hi all,

I want to aggregate my data for 5-10 min and then flush the aggregated data to 
some database like vertica. updateStateByKey is not exactly helpful in this 
scenario as I can't flush all the records at once, neither can I clear the 
state. I wanted to know if anyone else has faced a similar issue and how did 
they handle it.

Thanks
Nikhil




Re: Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal
I don't think this is solving the problem. So here are the issues:
1) How do we push entire data to vertica. Opening a connection per record
will be too costly
2) If a key doesn't come again, how do we push this to vertica
3) How do we schedule the dumping of data to avoid loading too much data in
state.



On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker  wrote:

> Yes, check out mapWithState:
>
> https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html
>
> _
> From: Nikhil Goyal 
> Sent: Monday, May 23, 2016 23:28
> Subject: Timed aggregation in Spark
> To: 
>
>
>
> Hi all,
>
> I want to aggregate my data for 5-10 min and then flush the aggregated
> data to some database like vertica. updateStateByKey is not exactly helpful
> in this scenario as I can't flush all the records at once, neither can I
> clear the state. I wanted to know if anyone else has faced a similar issue
> and how did they handle it.
>
> Thanks
> Nikhil
>
>
>


Re: Timed aggregation in Spark

2016-05-23 Thread Ofir Kerker
Yes, check out 
mapWithState:https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

_
From: Nikhil Goyal 
Sent: Monday, May 23, 2016 23:28
Subject: Timed aggregation in Spark
To:  


Hi all,
I want to aggregate my data for 5-10 min and then flush the aggregated data to 
some database like vertica. updateStateByKey is not exactly helpful in this 
scenario as I can't flush all the records at once, neither can I clear the 
state. I wanted to know if anyone else has faced a similar issue and how did 
they handle it.
ThanksNikhil