[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS

2019-12-10 Thread GitBox
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental 
parsing of event logs in SHS
URL: https://github.com/apache/spark/pull/26821#issuecomment-564422644
 
 
   @shahidki31 
   No I didn't intend to persuade you to close this. I'd just wanted to make 
sure we get a clear picture of full implementation before dealing with each 
part, but it's OK for me if you'd like to deal with current solution as I think 
I can deal with extending the solution with snapshotting.
   
   I could take a look with current solution, but you still need to persuade at 
least one committer to push this forward.
   
   Btw, we'd be better to clarify the performance test in details. It should 
include at least...
   
   * size of event log file for initial load
   * elapsed time for initial load
   * count/size of events for addition (mostly about size)
   * elapsed time for loading additional events
   
   For me, your statement in PR description sounds to me as skipping (via read 
and drop) 2G takes around 2 secs which is still not ideal (as we know how to do 
it better), though I agree that's still a huge improvement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS

2019-12-10 Thread GitBox
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental 
parsing of event logs in SHS
URL: https://github.com/apache/spark/pull/26821#issuecomment-564357783
 
 
   > there are 2 ways to skip the parsed content: filtering by lines (the 
approach in this PR) and skipping by bytes (see mergeApplicationListing).
   
   Yeah if possible we should deal with latter approach. I guess that brings 
more changes as ReplayListener just relies on Scala API which provides lines 
(no offset information) so we should get our hands be dirty, but it definitely 
worths to do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS

2019-12-09 Thread GitBox
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental 
parsing of event logs in SHS
URL: https://github.com/apache/spark/pull/26821#issuecomment-563780160
 
 
   > Instead, in this PR, I am not closing the store, whenever there are 
changes in event logs or invalidate cache.
   
   This patch is simpler because this doesn't take "restarting SHS" into 
account. Restarting SHS will lose the information. And for now we may not want 
to tracking line offset in SHS's KV store (`listing`) since the line offset is 
only effective during single run of SHS.
   
   When you consider restoring KV store & state of listeners during restarting 
of SHS, you will have to store the snapshot of KV store into somewhere (that's 
why SPARK-29111 came in) and then you have to concern about compatibility of 
snapshot (entities in KV store including live entities on listeners) across 
Spark versions. That's why I had to change the design and introduce SPARK-29779 
instead of snapshotting.
   
   We've already gone through bunch of discussions because it is not that 
simple in reality as it seems; so please go through these PRs as well as design 
docs. 
   
   I guess the patch can be reviewed right now if the community prefers to have 
a solution which works within single SHS run first (though this may conflict 
with compaction #26416 which needs some arrangement), but if the community 
wants to have a solution which covers more cases, SPARK-28870 seems to be the 
way to go. (It doesn't mean this patch will not be valid - this patch could 
cover SPARK-29261 with some modification.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental parsing of event logs in SHS

2019-12-09 Thread GitBox
HeartSaVioR commented on issue #26821: [SPARK-20656][CORE]Support Incremental 
parsing of event logs in SHS
URL: https://github.com/apache/spark/pull/26821#issuecomment-563502972
 
 
   Thanks for cc.ing me, @dongjoon-hyun . I'll take a look.
   
   Btw, I think we have another JIRA issue for supporting incremental parsing 
[SPARK-28870](https://issues.apache.org/jira/browse/SPARK-28870) which has 
broader goal - run with any implementation of KVStore.
   
   At first glance, this patch could cover SPARK-29261 and with SPARK-29111 it 
may resolve SPARK-28870 altogether - though we struggled on the details 
previously so I need some time to go through deeply.
   
   @shahidki31 
   I guess you've been following through the previous discussions/efforts 
@Ngone51 and me, and @vanzin, @squito have been made. (#25577 and #25943, and 
relevant google docs in relevant JIRA issues/PRs) If not, it would worth to go 
through, as we've discussed in details.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org