Roshan Naik created STORM-2308:
----------------------------------

             Summary: Support for Non-replayable Sources
                 Key: STORM-2308
                 URL: https://issues.apache.org/jira/browse/STORM-2308
             Project: Apache Storm
          Issue Type: Sub-task
          Components: storm-core
    Affects Versions: 2.0.0
            Reporter: Roshan Naik


In order to recover from failures without data loss, Storm (and other streaming 
systems) places the responsibility of buffering events on the source system. In 
the event of a crash or other failure, in-flight events can be re-fetched from 
the source and their processing can be retried on recovery. A nice benefit of 
this approach is that it keeps Storm’s architecture simple. 

While it is desirable to avoid the complexities of creating an internal 
reliable buffering system, it is not necessary to restrict Spouts to accept 
data only from persistent sources such Kafka, Hdfs or databases. Some amount of 
data loss is acceptable in many uses cases. Storm already supports such use 
cases by allowing ACK-ing to be disabled. 

Users who can tolerate data loss, benefit from having spouts that can accept 
data directly from a wider variety of sources such as HTTP, TCP/UDP, Syslog, 
Flume etc. For such use cases, by not forcing all data to go through a system 
like Kafka, end-to-end latency improves in addition to simplifying management 
and reducing cost of the data pipeline. Users who care about not losing data 
can always funnel the incoming data via Kafka or another persistent store and 
enable ACKs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to