[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @aalobaidi I just would like to see the benefit of unloading the version of state which is expected to be read from the next batch. Totally I agree current mechanism of cache is excessive,

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-11 Thread aalobaidi
Github user aalobaidi commented on the issue: https://github.com/apache/spark/pull/21500 @HeartSaVioR 1. As I mentioned before, this option is beneficial for use cases with bigger micro-batches. This way the overhead of loading the state from disk will be spread across

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 After enabling option, I've observed small expected latency whenever starting batch per each partition per each batch. Median/average was 4~50 ms for my case, but max latency was a bit higher

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @aalobaidi When starting batch, latest version state is being read to start a new version of state. If the state should be restored from snapshot as well as delta files, it will incur huge

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-08 Thread aalobaidi
Github user aalobaidi commented on the issue: https://github.com/apache/spark/pull/21500 I can confirm that snapshots are still being built normally with no issue. @HeartSaVioR not sure why executor must load at least 1 version of state in memory. Could you elaborate? ---

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21500 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-07 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @aalobaidi One thing you may want to be aware is that in point of executor's view, executor must load at least 1 version of state in memory regardless of caching versions. I guess you may

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-07 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @aalobaidi You can also merge #21506 (maybe with changing log level or modify the patch to set message to INFO level) and see latencies on loading state, snapshotting, cleaning up. ---

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-07 Thread aalobaidi
Github user aalobaidi commented on the issue: https://github.com/apache/spark/pull/21500 Sorry for the late reply. The option is useful for specific use case which is micro-batches with relatively large number partitions with each of the partitions is very big in size. When this

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-07 Thread arunmahadevan
Github user arunmahadevan commented on the issue: https://github.com/apache/spark/pull/21500 Clearing the map after each commit might make things worse, since the maps needs to be loaded from the snapshot + delta files for the next micro-batch. Setting

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-07 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 Retaining versions of state is also relevant to do snapshotting the last version in files: HDFSBackedStateStoreProvider doesn't snapshot if the version doesn't exist in loadedMaps. So we may

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-06 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @TomaszGaweda @aalobaidi Please correct me if I'm missing here. From every start of batch, state store loads previous version of state so that it can be read and written. If we

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-06 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21500 @HeartSaVioR IMHO we should consider new state provider such as RocksDB, like Flink and Databricks Delta did. It is not a direct fix, but will improve latency and memory consumption, maybe

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-05 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 I agree that current cache approach may consume excessive memory unnecessarily, and that's also same to my finding in #21469. The issue is not that simple however, because in

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21500 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-05 Thread aalobaidi
Github user aalobaidi commented on the issue: https://github.com/apache/spark/pull/21500 @tdas this is the change I mentioned in our chat in SparkSummit. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21500 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional