Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
@aalobaidi
I just would like to see the benefit of unloading the version of state
which is expected to be read from the next batch. Totally I agree current
mechanism of cache is excessive,
Github user aalobaidi commented on the issue:
https://github.com/apache/spark/pull/21500
@HeartSaVioR
1. As I mentioned before, this option is beneficial for use cases with
bigger micro-batches. This way the overhead of loading the state from disk will
be spread across
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
After enabling option, I've observed small expected latency whenever
starting batch per each partition per each batch. Median/average was 4~50 ms
for my case, but max latency was a bit higher
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
@aalobaidi
When starting batch, latest version state is being read to start a new
version of state. If the state should be restored from snapshot as well as
delta files, it will incur huge
Github user aalobaidi commented on the issue:
https://github.com/apache/spark/pull/21500
I can confirm that snapshots are still being built normally with no issue.
@HeartSaVioR not sure why executor must load at least 1 version of state in
memory. Could you elaborate?
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21500
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
@aalobaidi
One thing you may want to be aware is that in point of executor's view,
executor must load at least 1 version of state in memory regardless of caching
versions. I guess you may
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
@aalobaidi
You can also merge #21506 (maybe with changing log level or modify the
patch to set message to INFO level) and see latencies on loading state,
snapshotting, cleaning up.
---
Github user aalobaidi commented on the issue:
https://github.com/apache/spark/pull/21500
Sorry for the late reply. The option is useful for specific use case which
is micro-batches with relatively large number partitions with each of the
partitions is very big in size. When this
Github user arunmahadevan commented on the issue:
https://github.com/apache/spark/pull/21500
Clearing the map after each commit might make things worse, since the maps
needs to be loaded from the snapshot + delta files for the next micro-batch.
Setting
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
Retaining versions of state is also relevant to do snapshotting the last
version in files: HDFSBackedStateStoreProvider doesn't snapshot if the version
doesn't exist in loadedMaps. So we may
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
@TomaszGaweda @aalobaidi
Please correct me if I'm missing here.
From every start of batch, state store loads previous version of state so
that it can be read and written. If we
Github user TomaszGaweda commented on the issue:
https://github.com/apache/spark/pull/21500
@HeartSaVioR IMHO we should consider new state provider such as RocksDB,
like Flink and Databricks Delta did. It is not a direct fix, but will improve
latency and memory consumption, maybe
Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/21500
I agree that current cache approach may consume excessive memory
unnecessarily, and that's also same to my finding in #21469.
The issue is not that simple however, because in
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21500
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user aalobaidi commented on the issue:
https://github.com/apache/spark/pull/21500
@tdas this is the change I mentioned in our chat in SparkSummit.
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21500
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
17 matches
Mail list logo