[jira] [Commented] (SPARK-28120) RocksDB state storage
[ https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133112#comment-17133112 ] Vikram Agrawal commented on SPARK-28120: The implementation is available here (https://github.com/qubole/spark-state-store). I have published it in mvn. It can be downloaded from here (https://mvnrepository.com/artifact/com.qubole.spark/spark-rocksdb-state-store) > RocksDB state storage > - > > Key: SPARK-28120 > URL: https://issues.apache.org/jira/browse/SPARK-28120 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Vikram Agrawal >Priority: Major > > SPARK-13809 introduced a framework for state management for computing > Streaming Aggregates. The default implementation was in-memory hashmap which > was backed up in HDFS complaint file system at the end of every micro-batch. > Current implementation suffers from Performance and Latency Issues. It uses > Executor JVM memory to store the states. State store size is limited by the > size of the executor memory. Also > Executor JVM memory is shared by state storage and other tasks operations. > State storage size will impact the performance of task execution > Moreover, GC pauses, executor failures, OOM issues are common when the size > of state storage increases which increases overall latency of a micro-batch > RocksDb is an embedded DB which can provide major performance improvements. > Other major streaming frameworks have rocksdb as default state storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28120) RocksDB state storage
[ https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Agrawal resolved SPARK-28120. Resolution: Later The implementation will be submitted to https://spark-packages.org. > RocksDB state storage > - > > Key: SPARK-28120 > URL: https://issues.apache.org/jira/browse/SPARK-28120 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Vikram Agrawal >Priority: Major > > SPARK-13809 introduced a framework for state management for computing > Streaming Aggregates. The default implementation was in-memory hashmap which > was backed up in HDFS complaint file system at the end of every micro-batch. > Current implementation suffers from Performance and Latency Issues. It uses > Executor JVM memory to store the states. State store size is limited by the > size of the executor memory. Also > Executor JVM memory is shared by state storage and other tasks operations. > State storage size will impact the performance of task execution > Moreover, GC pauses, executor failures, OOM issues are common when the size > of state storage increases which increases overall latency of a micro-batch > RocksDb is an embedded DB which can provide major performance improvements. > Other major streaming frameworks have rocksdb as default state storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28120) RocksDB state storage
[ https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869134#comment-16869134 ] Vikram Agrawal commented on SPARK-28120: I have raised a PR for it - https://github.com/apache/spark/pull/24922 > RocksDB state storage > - > > Key: SPARK-28120 > URL: https://issues.apache.org/jira/browse/SPARK-28120 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Vikram Agrawal >Priority: Major > > SPARK-13809 introduced a framework for state management for computing > Streaming Aggregates. The default implementation was in-memory hashmap which > was backed up in HDFS complaint file system at the end of every micro-batch. > Current implementation suffers from Performance and Latency Issues. It uses > Executor JVM memory to store the states. State store size is limited by the > size of the executor memory. Also > Executor JVM memory is shared by state storage and other tasks operations. > State storage size will impact the performance of task execution > Moreover, GC pauses, executor failures, OOM issues are common when the size > of state storage increases which increases overall latency of a micro-batch > RocksDb is an embedded DB which can provide major performance improvements. > Other major streaming frameworks have rocksdb as default state storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13809) State Store: A new framework for state management for computing Streaming Aggregates
[ https://issues.apache.org/jira/browse/SPARK-13809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868403#comment-16868403 ] Vikram Agrawal commented on SPARK-13809: [~skonto] I have raised SPARK-28120 for rocksdb implementation and started a [PR|https://github.com/apache/spark/pull/24922] for the same. > State Store: A new framework for state management for computing Streaming > Aggregates > > > Key: SPARK-13809 > URL: https://issues.apache.org/jira/browse/SPARK-13809 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28120) RocksDB state storage
Vikram Agrawal created SPARK-28120: -- Summary: RocksDB state storage Key: SPARK-28120 URL: https://issues.apache.org/jira/browse/SPARK-28120 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Vikram Agrawal SPARK-13809 introduced a framework for state management for computing Streaming Aggregates. The default implementation was in-memory hashmap which was backed up in HDFS complaint file system at the end of every micro-batch. Current implementation suffers from Performance and Latency Issues. It uses Executor JVM memory to store the states. State store size is limited by the size of the executor memory. Also Executor JVM memory is shared by state storage and other tasks operations. State storage size will impact the performance of task execution Moreover, GC pauses, executor failures, OOM issues are common when the size of state storage increases which increases overall latency of a micro-batch RocksDb is an embedded DB which can provide major performance improvements. Other major streaming frameworks have rocksdb as default state storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711319#comment-16711319 ] Vikram Agrawal commented on SPARK-18165: Hi [~danielil] - right now it is available at https://github.com/qubole/kinesis-sql > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696303#comment-16696303 ] Vikram Agrawal commented on SPARK-18165: [~piyush9194] - The library is already available for both 2.3 and 2.4. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494732#comment-16494732 ] Vikram Agrawal commented on SPARK-18165: [~mail2sivan...@gmail.com] - This library has been tested and developed against SPARK-2.2.X. I understand that you are trying it against SPARK-2.3.0. Can you please raise an issue in the kinesis-sql repo (https://github.com/qubole/kinesis-sql) and we can have a further discussion there. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466897#comment-16466897 ] Vikram Agrawal commented on SPARK-18165: Thanks [~marmbrus] - Planning to start the work on porting the connector in next few weeks. Will share my feedbacks/ask for help once I am ready. - Thanks for your suggestion. Will check out apache Bahir/Spark Packages and start a PR once I have ported my changes to DataSourceV2 APIs. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391557#comment-16391557 ] Vikram Agrawal commented on SPARK-18165: [~gaurav24] - yeah I saw that. Nonetheless, I have spent enough time going through available Kinesis APIs and Structured Streaming Source Provider requirement to come up with this library. You can give it a try and share your feedbacks/suggestions. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391305#comment-16391305 ] Vikram Agrawal commented on SPARK-18165: I have worked on an implementation of Kinesis Integration as a source for Structured Streaming. It's available here: https://github.com/qubole/kinesis-sql. Please try it out. Would be happy to discuss the design details and work on any concerns. If the implementation is acceptable and there is enough interest, I will start a PR for it. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org