[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-11-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240801#comment-16240801
 ] 

Jason Lowe commented on YARN-7272:
--

bq. Another possible case to handle is the case where storage is down i.e. 
instead of waiting for sync entity call to wait, it can be potentially 
committed to WAL till backend is unavailable. We can potentially explore this 
option.

My guess here is that this is going to be problematic because:

# By the time you get a robust, performant WAL implemented on HDFS you've 
practically reinvented the core of HBase.
# The point of having a synchronous call is to tell the client, "yes, I promise 
this has been persisted to the ATS database" yet it hasn't.

If the AM side-band signals another client to start reading from ATS then that 
other client will not see those writes despite the AM's synchronous call to the 
collector returning success.  The synchronous call cannot return until HBase 
says it has it.

In that sense, I don't see the WAL being so much a fault tolerance tool.  
Instead I see it as a performance enhancement tool where it can buffer more 
asynchronous events before blocking the caller or potentially recover more 
asynchronous events in the case of a collector tool crash.  The latter requires 
a lot of work where I can see us essentially requiring or reinventing systems 
like Apache BookKeeper.  I don't see how the WAL helps in the synchronous call 
scenario, since the whole point of the synchronous call is to guarantee the 
result appears in the ATSv2 database.

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-11-06 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240754#comment-16240754
 ] 

Varun Saxena commented on YARN-7272:


Sorry for coming in a little late on this discussion, although we did discuss 
it during the call.
The primary objective of fault tolerance is to ensure that the entities which 
are guaranteed to be written by timeline service v2 are not lost. 
But writing every entity to some sort of WAL implementation would be expensive.

Now, we have 2 kinds of entity writes, sync and async.
Sync entities are guaranteed to be written to the backend via collector or an 
exception, even for server-side failures, is returned i.e. we indicate to the 
client that an entity could not be written all the way to the backend so that 
it can retry or take some other suitable action.
Async entities, as the name suggests are written asynchronously. They are not 
guaranteed to be written to the backend, by design. We initially cache them at 
the client side for some time or till a sync entity arrives, combine them and 
then send them to collector. Moreover, if any exception occurs in writing to 
the backend, the result is not propagated back to the client. We only throw 
exceptions for client-side failures.
Async entities are later cached in HBase writer implementation too, inside 
collector, before being flushed to HBase.

Sync writes hence should be used for publishing important events, while async 
writes should be used for not so important events, losing which should not be a 
big deal in case of a failure. For instance, publishing metric values every N 
seconds can be an asynchronous write, unless the metric is very important, say, 
used for billing.

Keeping this in mind, a client can potentially do synchronous writes if it 
cares about durability of entity data.
Furthermore, asynchronous writes can have other points of failure too. For 
instance, the collector can crash while writing the async entity to WAL. In 
this case, we currently do not propagate this error to timeline client i.e. 
client would not know which entity writes have failed.

Another possible case to handle is the case where storage is down i.e. instead 
of waiting for sync entity call to wait, it can be potentially committed to WAL 
till backend is unavailable. We can potentially explore this option. Say, in 
cases where HBase cluster runs separately from the cluster where ATS is running.
For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase 
storage is down?

Thoughts?

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-11-06 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240021#comment-16240021
 ] 

Rohith Sharma K S commented on YARN-7272:
-

thanks [~vrushalic] for putting up summary. 
Adding to above points, some of the pros and cons which are discussed in call 
are
Pros :
# Additional WAL layer would help recover async entities. This ensures no 
entities are lost which are sent by TimelineV2Clients to collectors. 
Primarily 2 major down time trying to address with this JIRA i.e Collector JVM 
going down or Collector machine itself going down. 
# WAL layer is independent service that run on collector. It does not tightly 
bind to back end storage. This enables recovery of async entities nevertheless 
of any plugged in back end storage. 

Cons :
# Ensuring all async entities are written into WAL would be costly operation 
because multiple clients request will be waiting for writing into HDFS. This 
brings up request contention to write into WAL to ensure atomicity. This slows 
down request processing from TimelineClients. 
# This would become duplicated effort storing entities into WAL apart from back 
end storage!
# Since we keep only last 1 minute data, for every collector flush it is also 
required to rename the file in hdfs. This operation lead to creation of entity 
file spread across the cluster which lead to write performance slower since 
local write is always faster than remote write! Probably this need to think how 
we can deal with single file overall collector lifetime to keep track of last 1 
minute entities only. I see *truncate* API in hdfs, this need to check what 
does this api functionality.

I think _If cost of flushing into WAL for every async API is greater than or 
equal to cost of flushing into HBase(as of now) then it is better to go for 
flushing into HBase direclty_. But this approach tightly coupled with back end 
storage cost!

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-11-03 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238427#comment-16238427
 ] 

Vrushali C commented on YARN-7272:
--

this is the jira for fault tolerance for timeline collector 

cc [~jlowe] [~jrottinghuis] [~djp] as being discussed in the call 

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-11-02 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236796#comment-16236796
 ] 

Vrushali C commented on YARN-7272:
--

Sharing some thoughts:
Collector fault tolerance helps deal with two things:
- when collector itself goes down
- when the data that is in the memory of the buffered mutator which has NOT yet 
been flushed to hbase is lost.

Fault tolerance solution should have the ability to be turned on/ off. And 
should be off by default.

It should be a cluster wide default as well as allowed as a client specific 
setting as well. For example, some super critical application might be 
requiring zero tolerance for timeline data loss, in which case, it can be 
turned on for this specific app. For some other app, slightly different tuning 
may be preferable. And for all other apps, writing to offline storage should 
have the ability to be turned off. 



> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220935#comment-16220935
 ] 

Hadoop QA commented on YARN-7272:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 18m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 57s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  6m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  4s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 34 new + 211 unchanged - 1 fixed = 245 total (was 212) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
49s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
37s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
2s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch 
passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
30s{color} | {color:red} The patch generated 4 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 97m 57s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 |
|  |  Inconsistent synchronization of 
org.apache.hadoop.yarn.server.timelineservice.recovery.FileSystemWALstore.deleteLogPathRoot;
 locked 50% of time  Unsynchronized access at FileSystemWALstore.java:50% of 
time  Unsynchronized access at FileSystemWALstore.java:[line 345] |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:5b98639 |
| JIRA Issue | YARN-7272 |
| JIRA 

[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203129#comment-16203129
 ] 

Rohith Sharma K S commented on YARN-7272:
-

Update : I had offline discussion with Vinod and his concern is scope of this 
JIRA is limited to auxiliary services that runs on NodeManager. Given app 
collectors can be launched as separate container which is long term goal but 
not supported yet, fault tolerance design should consider all those use cases 
as well. Otherwise it will end up in redesigning new fault tolerance solution 
later.
Thinking wrt to container based app collectors recovery which also holds good 
for auxiliary service recovery, storing WAL in HDFS makes more appropriate. 

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203096#comment-16203096
 ] 

Vinod Kumar Vavilapalli commented on YARN-7272:
---

bq. In 1st cases, there will be outstanding unflushed entities in app collector 
buffer. If NM is restarted then it will looses all the outstanding entities 
from app collector buffer. So, scope of fault tolerance is restricted to NM JVM 
restart only
bq. In 2nd case, since NM machine itself is down which looses all the running 
master containers. RM will launches these master container in different machine 
as a second attempt.
This assumes that the collector lives inside the NM. One of the design goals 
for large scale apps is to fork the collector into its own container. When that 
is implemented, the above assumptions will be invalidated. We will have new 
fault scenarios where collector and AM may run on different machines, only 
collector dies and restarts on a different machine etc.

bq. Since it is fresh attempt, old attempt data is not much important to end 
user. Considering this behavior, 2nd case can be eliminated by considering for 
fault tolerance of app collectors. 
If our goal is to take care of entity/event data in transit for 1 min (assuming 
the collector flush interval is 1 min), we should be equally concerned about 
data loss either due to NM failure or machine failure or HBase failures.

Granted a HBase client buffer solution is faster / cheaper than levelDB 
solution which is in turn faster /cheaper than writing a JobHistory like WAL to 
HDFS. But the last one will encompass all those faults collectively, no?

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-09 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198201#comment-16198201
 ] 

Rohith Sharma K S commented on YARN-7272:
-

thanks for clarifying doubts!
bq. Is there a specific concern about using leveldb to implement the WAL for 
transient persistence? 
We don't have any concerns for using leveldb. Given delete operation can be 
performed, I would also highly recommend for using level db.  

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196949#comment-16196949
 ] 

Jason Lowe commented on YARN-7272:
--

I'm not proposing we use leveldb for persisting the entities long-term, rather 
only for the duration between receipt from the client and up to the point the 
ATSv2 backend acknowledges receipt.  At that point the entries would be deleted 
from leveldb.  A routine, background compaction would prevent the database from 
growing to a point where recovery performance would be a concern.

The NM state store already does this today, deleting container, resource, and 
application entries when we no longer need to recover them.  Is there a 
specific concern about using leveldb to implement the WAL for transient 
persistence?  I just want to make sure we're not going to invent yet another 
WAL solution here as there are many to choose from already.

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-08 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196504#comment-16196504
 ] 

Rohith Sharma K S commented on YARN-7272:
-

thanks Jason for your inputs! I am looking not only RW operations but also for 
delete operations. Reasons for delete is if we start keeping all the entities 
in level-db, it would become too heavier like in ATS1. So only outstanding 
entities need to be kept under level-db which are not flushed yet into back 
end. After flushing succeeded, delete from WAL which reduces size of WAL. 
Advantage we get from this is recovery will be faster with only delta entities. 

bq. Leveldb is already a dependency used in multiple places, and I'd hate to 
see us add yet another dependency or reinvent the wheel here.
Sorry I didn't get consensus to be taken. Shouldn't ATSv2 use level db for WAL 
writers?

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194763#comment-16194763
 ] 

Jason Lowe commented on YARN-7272:
--

Leveldb seems like a great fit for this, IMO.  It has high performance for 
writes and works quite well in the nodemanager use-case.  This case seems 
identical in that the collector would write to the database and only read upon 
recovery.  Leveldb is already a dependency used in multiple places, and I'd 
hate to see us add yet another dependency or reinvent the wheel here.


> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-05 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194138#comment-16194138
 ] 

Rohith Sharma K S commented on YARN-7272:
-

This proposal discussed in ATS weekly call, and one of the concern from 
[~varun_saxena] is impact on performance if we use FileSystem. This need to be 
validated before and after WAL implementations. As a part of this discussion, 
also had thoughts on using level db for storing buffered entities. This also 
need to be validated. Probably, we can provide interface to WAL writer so that 
any efficient libraries can be plugged in either localFS or level db! 

> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance

2017-10-05 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194127#comment-16194127
 ] 

Rohith Sharma K S commented on YARN-7272:
-

thoughts on collector fault tolerance! Scenarios to consider for fault 
tolerance are
* NodeManager JVM restart! 
** NM is up and running but HBase cluster is down!
** TimelineClient async API put entities into app collector buffer, which is 
prone to loose data in short span of flush interval time!
* NM machines is lost either it can be network outage or split brain issues!

In 1st cases, there will be outstanding unflushed entities in app collector 
buffer. If NM is restarted then it will looses all the outstanding entities 
from app collector buffer. So, scope of fault tolerance is  restricted to NM 
JVM restart only.

In 2nd case, since NM machine itself is down which looses all the running 
master containers. RM will launches these master container in different machine 
as a second attempt. Since it is fresh attempt, old attempt data is not much 
important to end user. Considering this behavior, 2nd case can be eliminated by 
considering for fault tolerance of app collectors. 

Approach is to provide WAL in app collector. WAL will contains only unflushed 
entities entry in it. Any entities which are flushed are being removed from 
WAL. Once it is flushed, then we relay on back end fault tolerance 
functionality. This makes WAL to have very minimal data i.e maximum last 1 
minute data(1 minute is flush interval in app collector.) I have planned to use 
LocalFS to store WALs.



> Enable timeline collector fault tolerance
> -
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org