[jira] [Commented] (SPARK-28120) RocksDB state storage

2020-06-11 Thread Vikram Agrawal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133112#comment-17133112
 ] 

Vikram Agrawal commented on SPARK-28120:


The implementation is available here 
(https://github.com/qubole/spark-state-store). I have published it in mvn. It 
can be downloaded from here 
(https://mvnrepository.com/artifact/com.qubole.spark/spark-rocksdb-state-store)

> RocksDB state storage
> -
>
> Key: SPARK-28120
> URL: https://issues.apache.org/jira/browse/SPARK-28120
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Vikram Agrawal
>Priority: Major
>
> SPARK-13809 introduced a framework for state management for computing 
> Streaming Aggregates. The default implementation was in-memory hashmap which 
> was backed up in HDFS complaint file system at the end of every micro-batch. 
> Current implementation suffers from Performance and Latency Issues. It uses 
> Executor JVM memory to store the states. State store size is limited by the 
> size of the executor memory. Also
> Executor JVM memory is shared by state storage and other tasks operations. 
> State storage size will impact the performance of task execution
> Moreover, GC pauses, executor failures, OOM issues are common when the size 
> of state storage increases which increases overall latency of a micro-batch
> RocksDb is an embedded DB which can provide major performance improvements. 
> Other major streaming frameworks have rocksdb as default state storage.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28120) RocksDB state storage

2019-10-18 Thread Vikram Agrawal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Agrawal resolved SPARK-28120.

Resolution: Later

The implementation will be submitted to https://spark-packages.org. 

> RocksDB state storage
> -
>
> Key: SPARK-28120
> URL: https://issues.apache.org/jira/browse/SPARK-28120
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Vikram Agrawal
>Priority: Major
>
> SPARK-13809 introduced a framework for state management for computing 
> Streaming Aggregates. The default implementation was in-memory hashmap which 
> was backed up in HDFS complaint file system at the end of every micro-batch. 
> Current implementation suffers from Performance and Latency Issues. It uses 
> Executor JVM memory to store the states. State store size is limited by the 
> size of the executor memory. Also
> Executor JVM memory is shared by state storage and other tasks operations. 
> State storage size will impact the performance of task execution
> Moreover, GC pauses, executor failures, OOM issues are common when the size 
> of state storage increases which increases overall latency of a micro-batch
> RocksDb is an embedded DB which can provide major performance improvements. 
> Other major streaming frameworks have rocksdb as default state storage.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28120) RocksDB state storage

2019-06-20 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869134#comment-16869134
 ] 

Vikram Agrawal commented on SPARK-28120:


I have raised a PR for it - https://github.com/apache/spark/pull/24922

> RocksDB state storage
> -
>
> Key: SPARK-28120
> URL: https://issues.apache.org/jira/browse/SPARK-28120
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Vikram Agrawal
>Priority: Major
>
> SPARK-13809 introduced a framework for state management for computing 
> Streaming Aggregates. The default implementation was in-memory hashmap which 
> was backed up in HDFS complaint file system at the end of every micro-batch. 
> Current implementation suffers from Performance and Latency Issues. It uses 
> Executor JVM memory to store the states. State store size is limited by the 
> size of the executor memory. Also
> Executor JVM memory is shared by state storage and other tasks operations. 
> State storage size will impact the performance of task execution
> Moreover, GC pauses, executor failures, OOM issues are common when the size 
> of state storage increases which increases overall latency of a micro-batch
> RocksDb is an embedded DB which can provide major performance improvements. 
> Other major streaming frameworks have rocksdb as default state storage.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13809) State Store: A new framework for state management for computing Streaming Aggregates

2019-06-20 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868403#comment-16868403
 ] 

Vikram Agrawal commented on SPARK-13809:


[~skonto] I have raised SPARK-28120 for rocksdb implementation and started a 
[PR|https://github.com/apache/spark/pull/24922] for the same. 

> State Store: A new framework for state management for computing Streaming 
> Aggregates
> 
>
> Key: SPARK-13809
> URL: https://issues.apache.org/jira/browse/SPARK-13809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28120) RocksDB state storage

2019-06-20 Thread Vikram Agrawal (JIRA)
Vikram Agrawal created SPARK-28120:
--

 Summary: RocksDB state storage
 Key: SPARK-28120
 URL: https://issues.apache.org/jira/browse/SPARK-28120
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Vikram Agrawal


SPARK-13809 introduced a framework for state management for computing Streaming 
Aggregates. The default implementation was in-memory hashmap which was backed 
up in HDFS complaint file system at the end of every micro-batch. 

Current implementation suffers from Performance and Latency Issues. It uses 
Executor JVM memory to store the states. State store size is limited by the 
size of the executor memory. Also
Executor JVM memory is shared by state storage and other tasks operations. 
State storage size will impact the performance of task execution

Moreover, GC pauses, executor failures, OOM issues are common when the size of 
state storage increases which increases overall latency of a micro-batch

RocksDb is an embedded DB which can provide major performance improvements. 
Other major streaming frameworks have rocksdb as default state storage.  





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-12-06 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711319#comment-16711319
 ] 

Vikram Agrawal commented on SPARK-18165:


Hi [~danielil] - right now it is available at 
https://github.com/qubole/kinesis-sql

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-11-22 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696303#comment-16696303
 ] 

Vikram Agrawal commented on SPARK-18165:


[~piyush9194] - The library is already available for both 2.3 and 2.4. 

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-30 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494732#comment-16494732
 ] 

Vikram Agrawal commented on SPARK-18165:


[~mail2sivan...@gmail.com] - This library has been tested and developed against 
SPARK-2.2.X. I understand that you are trying it against SPARK-2.3.0. 

Can you please raise an issue in the kinesis-sql repo 
(https://github.com/qubole/kinesis-sql) and we can have a further discussion 
there.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Vikram Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466897#comment-16466897
 ] 

Vikram Agrawal commented on SPARK-18165:


Thanks [~marmbrus]

- Planning to start the work on porting the connector in next few weeks. Will 
share my feedbacks/ask for help once I am ready. 
- Thanks for your suggestion. Will check out apache Bahir/Spark Packages and 
start a PR once I have ported my changes to DataSourceV2 APIs.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-03-08 Thread Vikram Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391557#comment-16391557
 ] 

Vikram Agrawal commented on SPARK-18165:


[~gaurav24] - yeah I saw that. Nonetheless, I have spent enough time going 
through available Kinesis APIs and Structured Streaming Source Provider 
requirement to come up with this library. You can give it a try and share your 
feedbacks/suggestions.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-03-08 Thread Vikram Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391305#comment-16391305
 ] 

Vikram Agrawal commented on SPARK-18165:


I have worked on an implementation of Kinesis Integration as a source for 
Structured Streaming. It's available here: 
https://github.com/qubole/kinesis-sql.  

Please try it out. Would be happy to discuss the design details and work on any 
concerns. If the implementation is acceptable and there is enough interest, I 
will start a PR for it.


> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org