[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS URL: https://github.com/apache/spark/pull/26821#issuecomment-564422644 @shahidki31 No I didn't intend to persuade you to close this. I'd just wanted to make sure we get a clear picture of full implementation before dealing with each part, but it's OK for me if you'd like to deal with current solution as I think I can deal with extending the solution with snapshotting. I could take a look with current solution, but you still need to persuade at least one committer to push this forward. Btw, we'd be better to clarify the performance test in details. It should include at least... * size of event log file for initial load * elapsed time for initial load * count/size of events for addition (mostly about size) * elapsed time for loading additional events For me, your statement in PR description sounds to me as skipping (via read and drop) 2G takes around 2 secs which is still not ideal (as we know how to do it better), though I agree that's still a huge improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS URL: https://github.com/apache/spark/pull/26821#issuecomment-564357783 > there are 2 ways to skip the parsed content: filtering by lines (the approach in this PR) and skipping by bytes (see mergeApplicationListing). Yeah if possible we should deal with latter approach. I guess that brings more changes as ReplayListener just relies on Scala API which provides lines (no offset information) so we should get our hands be dirty, but it definitely worths to do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS URL: https://github.com/apache/spark/pull/26821#issuecomment-563780160 > Instead, in this PR, I am not closing the store, whenever there are changes in event logs or invalidate cache. This patch is simpler because this doesn't take "restarting SHS" into account. Restarting SHS will lose the information. And for now we may not want to tracking line offset in SHS's KV store (`listing`) since the line offset is only effective during single run of SHS. When you consider restoring KV store & state of listeners during restarting of SHS, you will have to store the snapshot of KV store into somewhere (that's why SPARK-29111 came in) and then you have to concern about compatibility of snapshot (entities in KV store including live entities on listeners) across Spark versions. That's why I had to change the design and introduce SPARK-29779 instead of snapshotting. We've already gone through bunch of discussions because it is not that simple in reality as it seems; so please go through these PRs as well as design docs. I guess the patch can be reviewed right now if the community prefers to have a solution which works within single SHS run first (though this may conflict with compaction #26416 which needs some arrangement), but if the community wants to have a solution which covers more cases, SPARK-28870 seems to be the way to go. (It doesn't mean this patch will not be valid - this patch could cover SPARK-29261 with some modification.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS URL: https://github.com/apache/spark/pull/26821#issuecomment-563502972 Thanks for cc.ing me, @dongjoon-hyun . I'll take a look. Btw, I think we have another JIRA issue for supporting incremental parsing [SPARK-28870](https://issues.apache.org/jira/browse/SPARK-28870) which has broader goal - run with any implementation of KVStore. At first glance, this patch could cover SPARK-29261 and with SPARK-29111 it may resolve SPARK-28870 altogether - though we struggled on the details previously so I need some time to go through deeply. @shahidki31 I guess you've been following through the previous discussions/efforts @Ngone51 and me, and @vanzin, @squito have been made. (#25577 and #25943, and relevant google docs in relevant JIRA issues/PRs) If not, it would worth to go through, as we've discussed in details. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org