PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context.
On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji <[email protected]> wrote: > So Spark (not streaming) does offer exactly once. Spark Streaming however, > can only do exactly once semantics *if the update operation is idempotent*. > updateStateByKey's update operation is idempotent, because it completely > replaces the previous state. > > So as long as you use Spark streaming, you must somehow make the update > operation idempotent. Replacing the entire state is the easiest way to do > it, but it's obviously expensive. > > The alternative is to do something similar to what Storm does. At that > point, you'll have to ask though if just using Storm is easier than that. > > > > > > On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni <[email protected]> > wrote: > >> As per my Best Understanding Spark Streaming offer Exactly once >> processing , is this achieve only through updateStateByKey or there is >> another way to do the same. >> >> Ashish >> >> On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji <[email protected]> wrote: >> >>> In that case I assume you need exactly once semantics. There's no >>> out-of-the-box way to do that in Spark. There is updateStateByKey, but it's >>> not practical with your use case as the state is too large (it'll try to >>> dump the entire intermediate state on every checkpoint, which would be >>> prohibitively expensive). >>> >>> So either you have to implement something yourself, or you can use Storm >>> Trident (or transactional low-level API). >>> >>> On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni <[email protected]> >>> wrote: >>> >>>> My Use case is below >>>> >>>> We are going to receive lot of event as stream ( basically Kafka Stream >>>> ) and then we need to process and compute >>>> >>>> Consider you have a phone contract with ATT and every call / sms / data >>>> useage you do is an event and then it needs to calculate your bill on real >>>> time basis so when you login to your account you can see all those variable >>>> as how much you used and how much is left and what is your bill till date >>>> ,Also there are different rules which need to be considered when you >>>> calculate the total bill one simple rule will be 0-500 min it is free but >>>> above it is $1 a min. >>>> >>>> How do i maintain a shared state ( total amount , total min , total >>>> data etc ) so that i know how much i accumulated at any given point as >>>> events for same phone can go to any node / executor. >>>> >>>> Can some one please tell me how can i achieve this is spark as in storm >>>> i can have a bolt which can do this ? >>>> >>>> Thanks, >>>> >>>> >>>> >>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <[email protected]> wrote: >>>> >>>>> I guess both. In terms of syntax, I was comparing it with Trident. >>>>> >>>>> If you are joining, Spark Streaming actually does offer windowed join >>>>> out of the box. We couldn't use this though as our event stream can grow >>>>> "out-of-sync", so we had to implement something on top of Storm. If your >>>>> event streams don't become out of sync, you may find the built-in join in >>>>> Spark Streaming useful. Storm also has a join keyword but its semantics >>>>> are >>>>> different. >>>>> >>>>> >>>>> > Also, what do you mean by "No Back Pressure" ? >>>>> >>>>> So when a topology is overloaded, Storm is designed so that it will >>>>> stop reading from the source. Spark on the other hand, will keep reading >>>>> from the source and spilling it internally. This maybe fine, in fairness, >>>>> but it does mean you have to worry about the persistent store usage in the >>>>> processing cluster, whereas with Storm you don't have to worry because the >>>>> messages just remain in the data store. >>>>> >>>>> Spark came up with the idea of rate limiting, but I don't feel this is >>>>> as nice as back pressure because it's very difficult to tune it such that >>>>> you don't cap the cluster's processing power but yet so that it will >>>>> prevent the persistent storage to get used up. >>>>> >>>>> >>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast < >>>>> [email protected]> wrote: >>>>> >>>>>> When you say Storm, did you mean Storm with Trident or Storm? >>>>>> >>>>>> My use case does not have simple transformation. There are complex >>>>>> events that need to be generated by joining the incoming event stream. >>>>>> >>>>>> Also, what do you mean by "No Back PRessure" ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wednesday, 17 June 2015 11:57 AM, Enno Shioji <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> We've evaluated Spark Streaming vs. Storm and ended up sticking with >>>>>> Storm. >>>>>> >>>>>> Some of the important draw backs are: >>>>>> Spark has no back pressure (receiver rate limit can alleviate this to >>>>>> a certain point, but it's far from ideal) >>>>>> There is also no exactly-once semantics. (updateStateByKey can >>>>>> achieve this semantics, but is not practical if you have any significant >>>>>> amount of state because it does so by dumping the entire state on every >>>>>> checkpointing) >>>>>> >>>>>> There are also some minor drawbacks that I'm sure will be fixed >>>>>> quickly, like no task timeout, not being able to read from Kafka using >>>>>> multiple nodes, data loss hazard with Kafka. >>>>>> >>>>>> It's also not possible to attain very low latency in Spark, if that's >>>>>> what you need. >>>>>> >>>>>> The pos for Spark is the concise and IMO more intuitive syntax, >>>>>> especially if you compare it with Storm's Java API. >>>>>> >>>>>> I admit I might be a bit biased towards Storm tho as I'm more >>>>>> familiar with it. >>>>>> >>>>>> Also, you can do some processing with Kinesis. If all you need to do >>>>>> is straight forward transformation and you are reading from Kinesis to >>>>>> begin with, it might be an easier option to just do the transformation in >>>>>> Kinesis. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Whatever you write in bolts would be the logic you want to apply on >>>>>> your events. In Spark, that logic would be coded in map() or similar such >>>>>> transformations and/or actions. Spark doesn't enforce a structure for >>>>>> capturing your processing logic like Storm does. >>>>>> Regards >>>>>> Sab >>>>>> Probably overloading the question a bit. >>>>>> >>>>>> In Storm, Bolts have the functionality of getting triggered on >>>>>> events. Is that kind of functionality possible with Spark streaming? >>>>>> During >>>>>> each phase of the data processing, the transformed data is stored to the >>>>>> database and this transformed data should then be sent to a new pipeline >>>>>> for further processing >>>>>> >>>>>> How can this be achieved using Spark? >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast < >>>>>> [email protected]> wrote: >>>>>> >>>>>> I have a use-case where a stream of Incoming events have to be >>>>>> aggregated and joined to create Complex events. The aggregation will have >>>>>> to happen at an interval of 1 minute (or less). >>>>>> >>>>>> The pipeline is : >>>>>> send events >>>>>> enrich event >>>>>> Upstream services -------------------> KAFKA ---------> event Stream >>>>>> Processor ------------> Complex Event Processor ------------> Elastic >>>>>> Search. >>>>>> >>>>>> From what I understand, Storm will make a very good ESP and Spark >>>>>> Streaming will make a good CEP. >>>>>> >>>>>> But, we are also evaluating Storm with Trident. >>>>>> >>>>>> How does Spark Streaming compare with Storm with Trident? >>>>>> >>>>>> Sridhar Chellappa >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wednesday, 17 June 2015 10:02 AM, ayan guha <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> I have a similar scenario where we need to bring data from kinesis to >>>>>> hbase. Data volecity is 20k per 10 mins. Little manipulation of data will >>>>>> be required but that's regardless of the tool so we will be writing that >>>>>> piece in Java pojo. >>>>>> All env is on aws. Hbase is on a long running EMR and kinesis on a >>>>>> separate cluster. >>>>>> TIA. >>>>>> Best >>>>>> Ayan >>>>>> On 17 Jun 2015 12:13, "Will Briggs" <[email protected]> wrote: >>>>>> >>>>>> The programming models for the two frameworks are conceptually rather >>>>>> different; I haven't worked with Storm for quite some time, but based on >>>>>> my >>>>>> old experience with it, I would equate Spark Streaming more with Storm's >>>>>> Trident API, rather than with the raw Bolt API. Even then, there are >>>>>> significant differences, but it's a bit closer. >>>>>> >>>>>> If you can share your use case, we might be able to provide better >>>>>> guidance. >>>>>> >>>>>> Regards, >>>>>> Will >>>>>> >>>>>> On June 16, 2015, at 9:46 PM, [email protected] wrote: >>>>>> >>>>>> Hi All, >>>>>> >>>>>> I am evaluating spark VS storm ( spark streaming ) and i am not able >>>>>> to see what is equivalent of Bolt in storm inside spark. >>>>>> >>>>>> Any help will be appreciated on this ? >>>>>> >>>>>> Thanks , >>>>>> Ashish >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
