[jira] [Created] (YARN-6318) timeline service schema creator fails if executed from a remote machine
Sangjin Lee created YARN-6318: - Summary: timeline service schema creator fails if executed from a remote machine Key: YARN-6318 URL: https://issues.apache.org/jira/browse/YARN-6318 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 3.0.0-alpha1 Reporter: Sangjin Lee The timeline service schema creator fails if executed from a remote machine and the remote machine does not have the right {{hbase-site.xml}} file to talk to that remote HBase cluster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6170) TimelineReaderServer should wait to join with HttpServer2
Sangjin Lee created YARN-6170: - Summary: TimelineReaderServer should wait to join with HttpServer2 Key: YARN-6170 URL: https://issues.apache.org/jira/browse/YARN-6170 Project: Hadoop YARN Issue Type: Sub-task Components: timelinereader Affects Versions: YARN-5355 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor While I was backporting YARN-5355-branch-2 to a 2.6.0-based code branch, I noticed that the timeline reader daemon would promptly shut down upon start. It turns out that in the 2.6.0 code line at least there are only daemon threads left once the main method returns. That causes the JVM to shut down. The right pattern to start an embedded jetty web server is to call {{Server.start()}} followed by {{Server.join()}}. That way, the server stays up reliably no matter what other threads get created. It works on YARN-5355 only because there *happens* to be one other non-daemon thread. We should add the {{join()}} call to be always correct. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6140) start time key in NM leveldb store should be removed when container is removed
Sangjin Lee created YARN-6140: - Summary: start time key in NM leveldb store should be removed when container is removed Key: YARN-6140 URL: https://issues.apache.org/jira/browse/YARN-6140 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-5355 Reporter: Sangjin Lee It appears that the start time key is not removed when the container is removed. The key was introduced in YARN-5792. I found this while backporting the YARN-5355-branch-2 branch to our internal branch loosely based on 2.6.0. The {{TestNMLeveldbStateStoreService}} test was failing because of this. I'm not sure why we didn't see this earlier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6095) create a REST API that returns the clusters for a given app id
Sangjin Lee created YARN-6095: - Summary: create a REST API that returns the clusters for a given app id Key: YARN-6095 URL: https://issues.apache.org/jira/browse/YARN-6095 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee It would be good to have a timeline service REST endpoint that can return the list of clusters for a given app id. This becomes possible after YARN-5378 is in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5792) adopt the id prefix for YARN, MR, and DS entities
Sangjin Lee created YARN-5792: - Summary: adopt the id prefix for YARN, MR, and DS entities Key: YARN-5792 URL: https://issues.apache.org/jira/browse/YARN-5792 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-5355 Reporter: Sangjin Lee We introduced the entity id prefix to support flexible entity sorting (YARN-5715). We should adopt the id prefix for YARN entities, MR entities, and DS entities to take advantage of the id prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5715) introduce entity prefix for return and sort order
Sangjin Lee created YARN-5715: - Summary: introduce entity prefix for return and sort order Key: YARN-5715 URL: https://issues.apache.org/jira/browse/YARN-5715 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Priority: Critical While looking into YARN-5585, we have come across the need to provide a sort order different than the current entity id order. The current entity id order returns entities strictly in the lexicographical order, and as such it returns the earliest entities first. This may not be the most natural return order. A more natural return/sort order would be from the most recent entities. To solve this, we would like to add what we call the "entity prefix" in the row key for the entity table. It is a number (long) that can be easily provided by the client on write. In the row key, it would be added before the entity id itself. The entity prefix would be considered mandatory. On all writes (including updates) the correct entity prefix should be set by the client so that the correct row key is used. The entity prefix needs to be unique only within the scope of the application and the entity type. For queries that return a list of entities, the prefix values will be returned along with the entity id's. Queries that specify the prefix and the id should be returned quickly using the row key. If the query omits the prefix but specifies the id (query by id), the query may be less efficient. This JIRA should add the entity prefix to the entity API and add its handling to the schema and the write path. The read path will be addressed in YARN-5585. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5379) TestHBaseTimelineStorage. testWriteApplicationToHBase() fails intermittently
Sangjin Lee created YARN-5379: - Summary: TestHBaseTimelineStorage. testWriteApplicationToHBase() fails intermittently Key: YARN-5379 URL: https://issues.apache.org/jira/browse/YARN-5379 Project: Hadoop YARN Issue Type: Bug Components: test, timelineserver Affects Versions: 3.0.0-alpha1 Reporter: Sangjin Lee Priority: Minor The {{TestHBaseTimelineStorage. testWriteApplicationToHBase()}} test seems to fail intermittently: {noformat} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorage.testWriteApplicationToHBase(TestHBaseTimelineStorage.java:817) {noformat} The stdout output: {noformat} 2016-07-13 00:15:48,883 INFO [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:(120)) - Process identifier=hconnection-0x2b7962a2 connecting to ZooKeeper ensemble=localhost:53474 2016-07-13 00:15:48,883 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=localhost:53474 sessionTimeout=9 watcher=hconnection-0x2b7962a20x0, quorum=localhost:53474, baseZNode=/hbase 2016-07-13 00:15:48,886 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server localhost/127.0.0.1:53474. Will not attempt to authenticate using SASL (unknown error) 2016-07-13 00:15:48,887 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to localhost/127.0.0.1:53474, initiating session 2016-07-13 00:15:48,887 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:53474] server.NIOServerCnxnFactory (NIOServerCnxnFactory.java:run(197)) - Accepted socket connection from /127.0.0.1:38097 2016-07-13 00:15:48,887 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:53474] server.ZooKeeperServer (ZooKeeperServer.java:processConnectRequest(868)) - Client attempting to establish new session at /127.0.0.1:38097 2016-07-13 00:15:48,896 INFO [SyncThread:0] server.ZooKeeperServer (ZooKeeperServer.java:finishSessionInit(617)) - Established session 0x155e19baa520025 with negotiated timeout 4 for client /127.0.0.1:38097 2016-07-13 00:15:48,896 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server localhost/127.0.0.1:53474, sessionid = 0x155e19baa520025, negotiated timeout = 4 2016-07-13 00:15:48,911 INFO [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:(120)) - Process identifier=hconnection-0x32130e61 connecting to ZooKeeper ensemble=localhost:53474 2016-07-13 00:15:48,912 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=localhost:53474 sessionTimeout=9 watcher=hconnection-0x32130e610x0, quorum=localhost:53474, baseZNode=/hbase 2016-07-13 00:15:48,917 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server localhost/127.0.0.1:53474. Will not attempt to authenticate using SASL (unknown error) 2016-07-13 00:15:48,918 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:53474] server.NIOServerCnxnFactory (NIOServerCnxnFactory.java:run(197)) - Accepted socket connection from /127.0.0.1:38098 2016-07-13 00:15:48,921 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to localhost/127.0.0.1:53474, initiating session 2016-07-13 00:15:48,921 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:53474] server.ZooKeeperServer (ZooKeeperServer.java:processConnectRequest(868)) - Client attempting to establish new session at /127.0.0.1:38098 2016-07-13 00:15:48,929 INFO [SyncThread:0] server.ZooKeeperServer (ZooKeeperServer.java:finishSessionInit(617)) - Established session 0x155e19baa520026 with negotiated timeout 4 for client /127.0.0.1:38098 2016-07-13 00:15:48,929 INFO [main-SendThread(localhost:53474)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server localhost/127.0.0.1:53474, sessionid = 0x155e19baa520026, negotiated timeout = 4 2016-07-13 00:15:48,938 INFO [main] storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:serviceStop(541)) - closing the entity table 2016-07-13 00:15:48,938 INFO [main] storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:serviceStop(546)) - closing the app_flow table 2016-07-13 00:15:48,938 INFO [main] storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:serviceStop(551)) - closing the application table 2016-07-13 00:15:48,941 INFO
[jira] [Created] (YARN-5364) timelineservice modules have indirect dependencies on mapreduce artifacts
Sangjin Lee created YARN-5364: - Summary: timelineservice modules have indirect dependencies on mapreduce artifacts Key: YARN-5364 URL: https://issues.apache.org/jira/browse/YARN-5364 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 3.0.0-alpha1 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor The new timelineservice and timelineservice-hbase-tests modules have indirect dependencies to mapreduce artifacts through HBase and phoenix. Although it's not causing builds to fail, it's not good hygiene. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5359) FileSystemTimelineReader/Writer uses unix-specific default
Sangjin Lee created YARN-5359: - Summary: FileSystemTimelineReader/Writer uses unix-specific default Key: YARN-5359 URL: https://issues.apache.org/jira/browse/YARN-5359 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha1 Reporter: Sangjin Lee Assignee: Sangjin Lee {{FileSystemTimelineReaderImpl}} and {{FileSystemTimelineWriterImpl}} use a unix-specific default. It won't work on Windows. Also, {{TestFileSystemTimelineReaderImpl}} uses this default directly, which is also brittle against concurrent tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5355) YARN Timeline Service v.2: alpha 2
Sangjin Lee created YARN-5355: - Summary: YARN Timeline Service v.2: alpha 2 Key: YARN-5355 URL: https://issues.apache.org/jira/browse/YARN-5355 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical This is an umbrella JIRA for the alpha 2 milestone for YARN Timeline Service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5354) TestDistributedShell.checkTimelineV2() may fail for concurrent tests
Sangjin Lee created YARN-5354: - Summary: TestDistributedShell.checkTimelineV2() may fail for concurrent tests Key: YARN-5354 URL: https://issues.apache.org/jira/browse/YARN-5354 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 3.0.0-alpha1 Reporter: Sangjin Lee Assignee: Sangjin Lee {{TestDistributedShell.checkTimelineV2()}} uses the default (hard-coded) storage root directory. This is brittle against concurrent tests. We should use a unique storage directory for the unit tests. We should also fix the default storage location for {{FileSystemTimelineWriterImpl}} to be cross-platform as part of this. The current value ( {{/tmp/timeline-service-data}} ) won't work on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5236) FlowRunCoprocessor brings down HBase RegionServer
[ https://issues.apache.org/jira/browse/YARN-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-5236. --- Resolution: Invalid Timeline Service v.2 documentation now states that it requires HBase 1.1.3. Closing. > FlowRunCoprocessor brings down HBase RegionServer > - > > Key: YARN-5236 > URL: https://issues.apache.org/jira/browse/YARN-5236 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Haibo Chen > > The FlowRunCoprocessor, when loaded in HBase, will bring down the region > server with exception > java.lang.NoSuchMethodError: > org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment.getRegion() > I am running it with HBase 1.2.1 in pseudo-distributed mode to try out ATS v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5316) fix hadoop-aws pom not to do the exclusion
Sangjin Lee created YARN-5316: - Summary: fix hadoop-aws pom not to do the exclusion Key: YARN-5316 URL: https://issues.apache.org/jira/browse/YARN-5316 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee We originally introduced an exclusion rule for {{hadoop-yarn-server-tests}} in {{hadoop-aws}}, as the {{hadoop-aws}} dependency on {{joda-time}} was colliding with that coming from {{hadoop-yarn-server-timelineservice}} (via {{phoenix-core}} ). Now that the phoenix dependency is no longer on {{hadoop-yarn-server-timelineservice}} itself (it's moved to {{hadoop-yarn-server-timelineservice-hbase-tests}} ), it is safe to remove the exclusion rule. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5252) EventDispatcher$EventProcessor.run() throws a findbugs error
[ https://issues.apache.org/jira/browse/YARN-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-5252. --- Resolution: Duplicate Thanks [~asuresh]! Hadn't noticed that one. > EventDispatcher$EventProcessor.run() throws a findbugs error > > > Key: YARN-5252 > URL: https://issues.apache.org/jira/browse/YARN-5252 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sangjin Lee >Priority: Minor > > Findbugs complains {{EventDispatcher$EventProcessor.run()}} invokes > {{System.exit()}}. This comes up every time yarn-common is touched. We should > either address it or make it an exception if there is a good reason for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5253) NodeStatusPBImpl throws a bunch of synchronization findbugs warnings
[ https://issues.apache.org/jira/browse/YARN-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-5253. --- Resolution: Duplicate Fixed by YARN-5075. > NodeStatusPBImpl throws a bunch of synchronization findbugs warnings > > > Key: YARN-5253 > URL: https://issues.apache.org/jira/browse/YARN-5253 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sangjin Lee >Priority: Minor > > There are several IS2_INCONSISTENT_SYNC findbugs warnings on > {{NodeStatusPBImpl}}. This should be addressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5252) EventDispatcher$EventProcessor.run() throws a findbugs error
Sangjin Lee created YARN-5252: - Summary: EventDispatcher$EventProcessor.run() throws a findbugs error Key: YARN-5252 URL: https://issues.apache.org/jira/browse/YARN-5252 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Sangjin Lee Priority: Minor Findbugs complains {{EventDispatcher$EventProcessor.run()}} invokes {{System.exit()}}. This comes up every time yarn-common is touched. We should either address it or make it an exception if there is a good reason for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5243) fix several rebase and other miscellaneous issues before merge
Sangjin Lee created YARN-5243: - Summary: fix several rebase and other miscellaneous issues before merge Key: YARN-5243 URL: https://issues.apache.org/jira/browse/YARN-5243 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee I have come across a couple of miscellaneous issues while inspecting the diffs against the trunk. We also need to review one last time (probably after the final rebase) to ensure the timeline services v.2 leaves no impact when disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5174) add documentation on needing to add hbase-site.xml on YARN cluster
Sangjin Lee created YARN-5174: - Summary: add documentation on needing to add hbase-site.xml on YARN cluster Key: YARN-5174 URL: https://issues.apache.org/jira/browse/YARN-5174 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee One part that is missing in the documentation is the need to add {{hbase-site.xml}} on the client side (the client hadoop cluster). First, we need to arrive at the minimally required client setting to connect to the right hbase cluster. Then, we need to document it so that users know exactly what to do to configure the cluster to use the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5169) most of YARN events have timestamp of -1
Sangjin Lee created YARN-5169: - Summary: most of YARN events have timestamp of -1 Key: YARN-5169 URL: https://issues.apache.org/jira/browse/YARN-5169 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.2 Reporter: Sangjin Lee Most of the YARN events (subclasses of {{AbstractEvent}}) have timestamp of -1. {{AbstractEvent}} have two constructors, one that initializes the timestamp to -1 and the other to the caller-provided value. But most events use the former (thus timestamp of -1). Some of the more common events, including {{ApplicationEvent}}, {{ContainerEvent}}, {{JobEvent}}, etc. do not set the timestamp. The rationale for this behavior seems to be mentioned in {{AbstractEvent}}: {code} // use this if you DON'T care about the timestamp public AbstractEvent(TYPE type) { this.type = type; // We're not generating a real timestamp here. It's too expensive. timestamp = -1L; } {code} This absence of the timestamp isn't really visible in many cases and therefore may have gone unnoticed, but the timeline service exposes this problem very visibly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5111) YARN container system metrics are not aggregated to application
Sangjin Lee created YARN-5111: - Summary: YARN container system metrics are not aggregated to application Key: YARN-5111 URL: https://issues.apache.org/jira/browse/YARN-5111 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Critical It appears that the container system metrics (CPU and memory) are not being aggregated onto the application. I definitely see container system metrics when I query for YARN_CONTAINER. However, there is no corresponding metrics on the parent application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5109) timestamps are stored unencoded causing parse errors
Sangjin Lee created YARN-5109: - Summary: timestamps are stored unencoded causing parse errors Key: YARN-5109 URL: https://issues.apache.org/jira/browse/YARN-5109 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Blocker When we store timestamps (for example as part of the row key or part of the column name for an event), the bytes are used as is without any encoding. If the byte value happens to contain a separator character we use (e.g. "!" or "="), it causes a parse failure when we read it. I came across this while looking into this error in the timeline reader: {noformat} 2016-05-17 21:28:38,643 WARN org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils: incorrectly formatted column name: it will be discarded {noformat} I traced the data that was causing this, and the column name (for the event) was the following: {noformat} i:e!YARN_RM_CONTAINER_CREATED=\x7F\xFF\xFE\xABDY=\x99=YARN_CONTAINER_ALLOCATED_HOST {noformat} Note that the column name is supposed to be of the format (event id)=(timestamp)=(event info key). However, observe the timestamp portion: {noformat} \x7F\xFF\xFE\xABDY=\x99 {noformat} The presence of the separator ("=") causes the parse error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5105) entire time series is returned for YARN container system metrics (CPU and memory)
Sangjin Lee created YARN-5105: - Summary: entire time series is returned for YARN container system metrics (CPU and memory) Key: YARN-5105 URL: https://issues.apache.org/jira/browse/YARN-5105 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee I see that the entire time series of the CPU and memory metrics are returned for the YARN containers REST query. This has a potential of bloating the output big time. {noformat} "metrics": [ { "type": "TIME_SERIES", "id": "MEMORY", "values": { "1463518173363": 407539712, "1463518170347": 407539712, {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5102) timeline service build fails with java 8
Sangjin Lee created YARN-5102: - Summary: timeline service build fails with java 8 Key: YARN-5102 URL: https://issues.apache.org/jira/browse/YARN-5102 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Blocker The build fails with java 8: {noformat} [WARNING] Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-common:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability the error(s) are [ Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency are: +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT +-jdk.tools:jdk.tools:1.8 and +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT +-org.apache.hbase:hbase-common:1.0.1 +-org.apache.hbase:hbase-annotations:1.0.1 +-jdk.tools:jdk.tools:1.7 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5096) timelinereader has a lot of logging that's not useful
Sangjin Lee created YARN-5096: - Summary: timelinereader has a lot of logging that's not useful Key: YARN-5096 URL: https://issues.apache.org/jira/browse/YARN-5096 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Minor After running about a dozen or so requests, the timelinereader log is filled with the following logging entries: {noformat} 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns 2016-05-16 15:59:13,364 INFO org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper: null prefix was specified; returning all columns {noformat} There were some ~ 3,000 such logging entries. It's too excessive. Also, when I requested YARN_CONTAINER with fields=ALL, I see the following logs: {noformat} WARN org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils: incorrectly formatted column name: it will be discarded {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5097) NPE in Separator.joinEncoded()
Sangjin Lee created YARN-5097: - Summary: NPE in Separator.joinEncoded() Key: YARN-5097 URL: https://issues.apache.org/jira/browse/YARN-5097 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Critical Both in the RM log and the NM log, I see the following exception thrown. First for RM, {noformat} 2016-05-16 14:19:29,930 ERROR org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector: Error aggregating timeline metrics java.lang.NullPointerException at org.apache.hadoop.yarn.server.timelineservice.storage.common.Separator.joinEncoded(Separator.java:249) at org.apache.hadoop.yarn.server.timelineservice.storage.application.ApplicationRowKey.getRowKey(ApplicationRowKey.java:110) at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:131) at org.apache.hadoop.yarn.server.timelineservice.collector.AppLevelTimelineCollector$AppLevelAggregator.run(AppLevelTimelineCollector.java:136) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} In the NM log, I see a similar exception: {noformat} 2016-05-16 14:54:23,116 ERROR org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector: Error aggregating timeline metrics java.lang.NullPointerException at org.apache.hadoop.yarn.server.timelineservice.storage.common.Separator.joinEncoded(Separator.java:249) at org.apache.hadoop.yarn.server.timelineservice.storage.application.ApplicationRowKey.getRowKey(ApplicationRowKey.java:110) at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:131) at org.apache.hadoop.yarn.server.timelineservice.collector.AppLevelTimelineCollector$AppLevelAggregator.run(AppLevelTimelineCollector.java:136) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5095) flow activities and flow runs are populated with wrong timestamp when RM restarts w/ recovery enabled
Sangjin Lee created YARN-5095: - Summary: flow activities and flow runs are populated with wrong timestamp when RM restarts w/ recovery enabled Key: YARN-5095 URL: https://issues.apache.org/jira/browse/YARN-5095 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Critical I have the RM recovery enabled. I see that upon restart the RM populates records into flow activity and flow runs but with *wrong* timestamps. What I mean by the timestamp is the part of the row key: - flow activity: row created with the day of the RM restart - flow run: row created with the RM start time as the "run id" The following illustrates an example flow run: {noformat} metrics: [ ], events: [ ], id: "sjlee@Sleep job/1463433569917", type: "YARN_FLOW_RUN", createdtime: 1463422860987, info: { UID: "yarn_cluster!sjlee!Sleep job!1463433569917", SYSTEM_INFO_FLOW_RUN_ID: 1463433569917, SYSTEM_INFO_FLOW_NAME: "Sleep job", SYSTEM_INFO_FLOW_RUN_END_TIME: 1463422865033, SYSTEM_INFO_USER: "sjlee" }, isrelatedto: { }, relatesto: { } {noformat} The created time and the end time are correct (i.e. original time), whereas the timestamp in the row key (= run id: 1463433569917) is actually later than the end time and coincides with the RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5094) some YARN container events have timestamp of -1 in REST output
Sangjin Lee created YARN-5094: - Summary: some YARN container events have timestamp of -1 in REST output Key: YARN-5094 URL: https://issues.apache.org/jira/browse/YARN-5094 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Some events in the YARN container entities have timestamp of -1. The RM-generated container events have proper timestamps. It appears that it's the NM-generated events that have -1: YARN_CONTAINER_CREATED, YARN_CONTAINER_FINISHED, YARN_NM_CONTAINER_LOCALIZATION_FINISHED, YARN_NM_CONTAINER_LOCALIZATION_STARTED. In the YARN container page, {noformat} { id: "YARN_CONTAINER_CREATED", timestamp: -1, info: { } }, { id: "YARN_CONTAINER_FINISHED", timestamp: -1, info: { YARN_CONTAINER_EXIT_STATUS: 0, YARN_CONTAINER_STATE: "RUNNING", YARN_CONTAINER_DIAGNOSTICS_INFO: "" } }, { id: "YARN_NM_CONTAINER_LOCALIZATION_FINISHED", timestamp: -1, info: { } }, { id: "YARN_NM_CONTAINER_LOCALIZATION_STARTED", timestamp: -1, info: { } } {noformat} I think the data itself is OK, but the values are not being populated in the REST output? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5093) created time shows 0 in most REST output
Sangjin Lee created YARN-5093: - Summary: created time shows 0 in most REST output Key: YARN-5093 URL: https://issues.apache.org/jira/browse/YARN-5093 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Critical When querying the REST API, I find that the created time value is returned as "0" for most of the output. It includes: - flow activity and flow runs in the flow activity page - apps in the application page - entities in the entity page For example, in the flow activity page, {noformat} { metrics: [ ], events: [ ], id: "yarn_cluster/146335680/sjlee@ds-date", type: "YARN_FLOW_ACTIVITY", createdtime: 0, flowruns: [ { metrics: [ ], events: [ ], id: "sjlee@ds-date/1463435661428", type: "YARN_FLOW_RUN", createdtime: 0, info: { SYSTEM_INFO_FLOW_VERSION: "1", SYSTEM_INFO_FLOW_RUN_ID: 1463435661428, SYSTEM_INFO_FLOW_NAME: "ds-date", SYSTEM_INFO_USER: "sjlee" }, isrelatedto: { }, relatesto: { } } ], info: { SYSTEM_INFO_CLUSTER: "yarn_cluster", UID: "yarn_cluster!sjlee!ds-date", SYSTEM_INFO_FLOW_NAME: "ds-date", SYSTEM_INFO_DATE: 146335680, SYSTEM_INFO_USER: "sjlee" }, isrelatedto: { }, relatesto: { } } {noformat} The only page that appears to show the proper created time value is the flow run page. I think the data exists in the storage but is not populated in the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5071) address HBase compatibility issues with trunk
Sangjin Lee created YARN-5071: - Summary: address HBase compatibility issues with trunk Key: YARN-5071 URL: https://issues.apache.org/jira/browse/YARN-5071 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical The trunk is now adding or planning to add more and more backward-incompatible changes. Some examples include - remove v.1 metrics classes (HADOOP-12504) - update jersey version (HADOOP-9613) - target java 8 by default (HADOOP-11858) This poses big challenges for the timeline service v.2 as we have a dependency on hbase which depends on an older version of hadoop. We need to find a way to solve/contain/manage these risks before it is too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5070) upgrade HBase version for first merge
Sangjin Lee created YARN-5070: - Summary: upgrade HBase version for first merge Key: YARN-5070 URL: https://issues.apache.org/jira/browse/YARN-5070 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Currently we set the HBase version for the timeline service storage to 1.0.1. This is a fairly old version, and there are reasons to upgrade to a newer version. We should upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5045) hbase unit tests fail due to dependency issues
Sangjin Lee created YARN-5045: - Summary: hbase unit tests fail due to dependency issues Key: YARN-5045 URL: https://issues.apache.org/jira/browse/YARN-5045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Blocker After the 5/4 rebase, the hbase unit tests in the timeline service project are failing: {noformat} org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage Time elapsed: 5.103 sec <<< ERROR! java.io.IOException: Shutting down at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at org.apache.hadoop.hbase.http.HttpServer.addDefaultServlets(HttpServer.java:677) at org.apache.hadoop.hbase.http.HttpServer.initializeWebServer(HttpServer.java:546) at org.apache.hadoop.hbase.http.HttpServer.(HttpServer.java:500) at org.apache.hadoop.hbase.http.HttpServer.(HttpServer.java:104) at org.apache.hadoop.hbase.http.HttpServer$Builder.build(HttpServer.java:345) at org.apache.hadoop.hbase.http.InfoServer.(InfoServer.java:77) at org.apache.hadoop.hbase.regionserver.HRegionServer.putUpWebUI(HRegionServer.java:1697) at org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:550) at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:333) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:139) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:217) at org.apache.hadoop.hbase.LocalHBaseCluster.(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:213) at org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:93) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:978) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:938) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:812) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:806) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:750) at org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage.setup(TestTimelineReaderWebServicesHBaseStorage.java:87) {noformat} The root cause is that the hbase mini server depends on hadoop common's {{MetricsServlet}} which has been removed in the trunk (HADOOP-12504): {noformat} Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/metrics/MetricsServlet at org.apache.hadoop.hbase.http.HttpServer.addDefaultServlets(HttpServer.java:677) at org.apache.hadoop.hbase.http.HttpServer.initializeWebServer(HttpServer.java:546) at org.apache.hadoop.hbase.http.HttpServer.(HttpServer.java:500) at org.apache.hadoop.hbase.http.HttpServer.(HttpServer.java:104) at org.apache.hadoop.hbase.http.HttpServer$Builder.build(HttpServer.java:345) at org.apache.hadoop.hbase.http.InfoServer.(InfoServer.java:77) at org.apache.hadoop.hbase.regionserver.HRegionServer.putUpWebUI(HRegionServer.java:1697) at org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:550) at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:333) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:139) ... 26 more {noformat} -- This message was sent by
[jira] [Resolved] (YARN-5014) Ensure non-metric values are returned as is for flow run table from the coprocessor
[ https://issues.apache.org/jira/browse/YARN-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-5014. --- Resolution: Fixed Fix Version/s: YARN-2928 This is fixed by YARN-4986. > Ensure non-metric values are returned as is for flow run table from the > coprocessor > --- > > Key: YARN-5014 > URL: https://issues.apache.org/jira/browse/YARN-5014 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > > Presently the FlowScanner class presumes existence of NumericValueConverter > in it's emitCells function. This causes an exception when we try to retrieve > non-numeric values from this table. > Exception is seen as: > {code} > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.timelineservice.storage.common.GenericConverter > cannot be cast to > org.apache.hadoop.yarn.server.timelineservice.storage.common.NumericValueConverter > at > org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.nextInternal(FlowScanner.java:246) > at > org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.nextRaw(FlowScanner.java:125) > at > org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.nextRaw(FlowScanner.java:119) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2117) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31443) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4821) have a separate NM timeline publishing interval
Sangjin Lee created YARN-4821: - Summary: have a separate NM timeline publishing interval Key: YARN-4821 URL: https://issues.apache.org/jira/browse/YARN-4821 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Currently the interval with which NM publishes container CPU and memory metrics is tied to {{yarn.nodemanager.resource-monitor.interval-ms}} whose default is 3 seconds. This is too aggressive. There should be a separate configuration that controls how often {{NMTimelinePublisher}} publishes container metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
Sangjin Lee created YARN-4761: - Summary: NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler Key: YARN-4761 URL: https://issues.apache.org/jira/browse/YARN-4761 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.4 Reporter: Sangjin Lee Assignee: Sangjin Lee YARN-3802 uncovered an issue with the scheduler where the resource calculation can be incorrect due to async event handling. It was subsequently fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
Sangjin Lee created YARN-4741: - Summary: RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue Key: YARN-4741 URL: https://issues.apache.org/jira/browse/YARN-4741 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sangjin Lee We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue. In our setup, we had the RM HA or stateful restart *disabled*, but NM work-preserving restart *enabled*. Due to other issues, we did a cluster-wide NM restart. Some time during the restart (which took multiple hours), we started seeing the async dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all the way up to tens of millions of events. When we looked at the RM log, it was full of the following messages: {noformat} 2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 {noformat} And that node in question was restarted a few minutes earlier. When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents. Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide rolling restart. Initially that seemed to have helped reduce the queue size, but the queue built back up to several millions and continued for an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4670) add logging when a node is AM-blacklisted
Sangjin Lee created YARN-4670: - Summary: add logging when a node is AM-blacklisted Key: YARN-4670 URL: https://issues.apache.org/jira/browse/YARN-4670 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.8.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Trivial Today there is not much logging happening when a node is blacklisted for an AM (see YARN-2005). We can add a little more logging to see this activity easily from the RM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4450) TestTimelineAuthenticationFilter and TestYarnConfigurationFields fail
Sangjin Lee created YARN-4450: - Summary: TestTimelineAuthenticationFilter and TestYarnConfigurationFields fail Key: YARN-4450 URL: https://issues.apache.org/jira/browse/YARN-4450 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee When I run the unit tests against the current branch, TestTimelineAuthenticationFilter and TestYarnConfigurationFields fail: {noformat} TestTimelineAuthenticationFilter.testDelegationTokenOperations:251 » NullPointer TestTimelineAuthenticationFilter.testDelegationTokenOperations:251 » NullPointer TestYarnConfigurationFields>TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml:429 class org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing in yarn-default.xml {noformat} The latter failure is caused by YARN-4356 (when we deprecated RM_SYSTEM_METRICS_PUBLISHER_ENABLED), and the former an older issue that was caused when a later use of field {{resURI}} was added in trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
Sangjin Lee created YARN-4356: - Summary: ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off Key: YARN-4356 URL: https://issues.apache.org/jira/browse/YARN-4356 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical For us to be able to merge the first milestone drop to trunk, we want to ensure that once disabled the timeline service v.2 has no impact from the server side to the client side. If the timeline service is not enabled, no action should be done. If v.1 is enabled but not v.2, v.1 should behave the same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4350) TestDistributedShell fails
Sangjin Lee created YARN-4350: - Summary: TestDistributedShell fails Key: YARN-4350 URL: https://issues.apache.org/jira/browse/YARN-4350 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. There seem to be 2 distinct issues. (1) testDSShellWithoutDomainV2* tests fail sporadically These test fail more often than not if tested by themselves: {noformat} testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 30.998 sec <<< FAILURE! java.lang.AssertionError: Application created event should be published atleast once expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) {noformat} They start happening after YARN-4129. I suspect this might have to do with some timing issue. (2) the whole test times out If you run the whole TestDistributedShell test, it times out without fail. This may or may not have to do with the port change introduced by YARN-2859 (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4284) condition for AM blacklisting is too narrow
Sangjin Lee created YARN-4284: - Summary: condition for AM blacklisting is too narrow Key: YARN-4284 URL: https://issues.apache.org/jira/browse/YARN-4284 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.8.0 Reporter: Sangjin Lee Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app attempt can be assigned to a different node. However, currently the condition under which the node gets blacklist is limited to {{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc. Since the AM blacklisting is per-app, there is little practical downside in blacklisting the nodes on *any failure* (although it might lead to blacklisting the node more aggressively than necessary). I would propose locating the next app attempt to a different node on any failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4261) fix the order of timelinereader in yarn/yarn.cmd
Sangjin Lee created YARN-4261: - Summary: fix the order of timelinereader in yarn/yarn.cmd Key: YARN-4261 URL: https://issues.apache.org/jira/browse/YARN-4261 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Trivial The order of the timelinereader command is not correct in yarn/yarn.cmd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4174) Fix javadoc warnings floating up from hbase
[ https://issues.apache.org/jira/browse/YARN-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-4174. --- Resolution: Done Fix Version/s: YARN-2928 This ended up getting fixed as part of YARN-3901. > Fix javadoc warnings floating up from hbase > > > Key: YARN-4174 > URL: https://issues.apache.org/jira/browse/YARN-4174 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Sangjin Lee >Priority: Minor > Fix For: YARN-2928 > > > As part of the patch for YARN-3901, [~sjlee0] observed some (~200) javadoc > warnings that are coming from hbase classes. > We tried a bunch of things like making the FlowRunCoprocessor class non > public and excluding the package from the pom. If the class in made non > public, the table creation has an exception. > {code} > 206 warnings > [WARNING] Javadoc Warnings > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestWALObserver.class): > warning: Cannot find annotation method 'value()' in type 'Category': class > file for org.junit.experimental.categories.Category not found > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerObserver.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithRemove.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithRemove.class): > warning: Cannot find annotation method 'timeout()' in type 'Test': class > file for org.junit.Test not found > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithAbort.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithAbort.class): > warning: Cannot find annotation method 'timeout()' in type 'Test' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithAbort.class): > warning: Cannot find annotation method 'timeout()' in type 'Test' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionObserverStacking.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionObserverScannerOpenHook.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionObserverInterface.class): > warning: Cannot find annotation method 'value()' in type 'Category' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionObserverInterface.class): > warning: Cannot find annotation method 'timeout()' in type 'Test' > [WARNING] > /Users/username/.m2/repository/org/apache/hbase/hbase-server/1.0.1/hbase-server-1.0.1-tests.jar(org/apache/hadoop/hbase/coprocessor/TestRegionObserverInterface.class): > warning: Cannot find annotation method 'timeout()' in type 'Test' > [WARNING] >
[jira] [Created] (YARN-4179) [reader implementation] support flow activity queries based on time
Sangjin Lee created YARN-4179: - Summary: [reader implementation] support flow activity queries based on time Key: YARN-4179 URL: https://issues.apache.org/jira/browse/YARN-4179 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Minor This came up as part of YARN-4074 and YARN-4075. Currently the only query pattern that's supported on the flow activity table is by cluster only. But it might be useful to support queries by cluster and certain date or dates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering
Sangjin Lee created YARN-4178: - Summary: [storage implementation] app id as string can cause incorrect ordering Key: YARN-4178 URL: https://issues.apache.org/jira/browse/YARN-4178 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Currently the app id is used in various places as part of row keys and in column names. However, they are treated as strings for the most part. This will cause a problem with ordering when the id portion of the app id rolls over to the next digit. For example, "app_1234567890_100" will be considered *earlier* than "app_1234567890_99". We should correct this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4116) refactor ColumnHelper read* methods
Sangjin Lee created YARN-4116: - Summary: refactor ColumnHelper read* methods Key: YARN-4116 URL: https://issues.apache.org/jira/browse/YARN-4116 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Currently we have several ColumnHelper.read* methods that are slightly different in terms of the initial conditions and behave different accordingly. We may want to refactor them so that the code reuse is strong and also the API stays reasonable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
Sangjin Lee created YARN-4074: - Summary: [timeline reader] implement support for querying for flows and flow runs Key: YARN-4074 URL: https://issues.apache.org/jira/browse/YARN-4074 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Implement support for querying for flows and flow runs. We should be able to query for the most recent N flows, etc. This includes changes to the {{TimelineReader}} API if necessary, as well as implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
Sangjin Lee created YARN-4075: - Summary: [reader REST API] implement support for querying for flows and flow runs Key: YARN-4075 URL: https://issues.apache.org/jira/browse/YARN-4075 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4064) build is broken at TestHBaseTimelineWriterImpl.java
Sangjin Lee created YARN-4064: - Summary: build is broken at TestHBaseTimelineWriterImpl.java Key: YARN-4064 URL: https://issues.apache.org/jira/browse/YARN-4064 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Blocker When YARN-4025 was committed, somehow the file rename from {{TestHBaseTimelineWriterImpl.java}} to {{TestHBaseTimelineStorage.java}} didn't happen as in the patch. As a result, the build is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3981) support timeline clients not associated with an application
Sangjin Lee created YARN-3981: - Summary: support timeline clients not associated with an application Key: YARN-3981 URL: https://issues.apache.org/jira/browse/YARN-3981 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee In the current v.2 design, all timeline writes must belong in a flow/application context (cluster + user + flow + flow run + application). But there are use cases that require writing data outside the context of an application. One such example is a higher level client (e.g. tez client or hive/oozie/cascading client) writing flow-level data that spans multiple applications. We need to find a way to support them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3949) ensure timely flush of timeline writes
Sangjin Lee created YARN-3949: - Summary: ensure timely flush of timeline writes Key: YARN-3949 URL: https://issues.apache.org/jira/browse/YARN-3949 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Currently flushing of timeline writes is not really handled. For example, {{HBaseTimelineWriterImpl}} relies on HBase's {{BufferedMutator}} to batch and write puts asynchronously. However, {{BufferedMutator}} may not flush them to HBase unless the internal buffer fills up. We do need a flush functionality first to ensure that data are written in a reasonably timely manner, and to be able to ensure some critical writes are done synchronously (e.g. key lifecycle events). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3906) split the application table from the entity table
Sangjin Lee created YARN-3906: - Summary: split the application table from the entity table Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3907) create the flow-version table
Sangjin Lee created YARN-3907: - Summary: create the flow-version table Key: YARN-3907 URL: https://issues.apache.org/jira/browse/YARN-3907 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Per discussions on YARN-3815, create the flow-version table that maps flow versions with various data about the versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3836) add equals and hashCode to TimelineEntity and other classes in the data model
Sangjin Lee created YARN-3836: - Summary: add equals and hashCode to TimelineEntity and other classes in the data model Key: YARN-3836 URL: https://issues.apache.org/jira/browse/YARN-3836 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Classes in the data model API (e.g. {{TimelineEntity}}, {{TimelineEntity.Identifer}}, etc.) do not override {{equals()}} or {{hashCode()}}. This can cause problems when these objects are used in a collection such as a {{HashSet}}. We should implement these methods wherever appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3741) consider nulling member maps/sets of TimelineEntity
Sangjin Lee created YARN-3741: - Summary: consider nulling member maps/sets of TimelineEntity Key: YARN-3741 URL: https://issues.apache.org/jira/browse/YARN-3741 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Currently there are multiple collection members of TimelineEntity that are always instantiated, regardless of whether they are used or not: info, configs, metrics, events, isRelatedToEntities, and relatesToEntities. Since TimelineEntities will be created very often and in lots of cases many of these members will be empty, creating these empty collections is wasteful in terms of garbage collector pressure. It would be good to start out with null members, and instantiate these collections only if they are actually used. Of course, we need to make that contract very clear and refactor all client code to handle that scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3721) build is broken on YARN-2928 branch due to possible dependency cycle
Sangjin Lee created YARN-3721: - Summary: build is broken on YARN-2928 branch due to possible dependency cycle Key: YARN-3721 URL: https://issues.apache.org/jira/browse/YARN-3721 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Blocker The build is broken on the YARN-2928 branch at the hadoop-yarn-server-timelineservice module. It's been broken for a while, but we didn't notice it because the build happens to work despite this if the maven local cache is not cleared. To reproduce, remove all hadoop (3.0.0-SNAPSHOT) artifacts from your maven local cache and build it. Almost certainly it was introduced by YARN-3529. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3634) TestMRTimelineEventHandling is broken due to timing issues
Sangjin Lee created YARN-3634: - Summary: TestMRTimelineEventHandling is broken due to timing issues Key: YARN-3634 URL: https://issues.apache.org/jira/browse/YARN-3634 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee TestMRTimelineEventHandling is broken. Relevant error message: {noformat} 2015-05-12 06:28:56,415 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:57,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:58,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:59,417 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:00,418 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:01,419 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:02,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:03,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:04,421 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:05,422 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:05,424 ERROR [AsyncDispatcher event handler] collector.NodeTimelineCollectorManager (NodeTimelineCollectorManager.java:postPut(121)) - Failed to communicate with NM Collector Service for application_1431412130291_0001 2015-05-12 06:29:05,425 WARN [AsyncDispatcher event handler] containermanager.AuxServices (AuxServices.java:logWarningWhenAuxServiceThrowExceptions(261)) - The auxService name is timeline_collector and it got an error at event: CONTAINER_INIT org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager.putIfAbsent(TimelineCollectorManager.java:97) at org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.addApplication(PerNodeTimelineCollectorsAuxService.java:99) at org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.initializeContainer(PerNodeTimelineCollectorsAuxService.java:126)
[jira] [Created] (YARN-3616) determine how to generate YARN container events
Sangjin Lee created YARN-3616: - Summary: determine how to generate YARN container events Key: YARN-3616 URL: https://issues.apache.org/jira/browse/YARN-3616 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee The initial design called for the node manager to write YARN container events to take advantage of the distributed writes. RM acting as a sole writer of all YARN container events would have significant scalability problems. Still, there are some types of events that are not captured by the NM. The current implementation has both: RM writing container events and NM writing container events. We need to sort this out, and decide how we can write all needed container events in a scalable manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3562) unit tests fail with the failure to bring up node manager
Sangjin Lee created YARN-3562: - Summary: unit tests fail with the failure to bring up node manager Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Priority: Minor A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-3390. --- Resolution: Fixed Fix Version/s: YARN-2928 Committed. Thanks much [~zjshen] and [~Naganarasimha] for working on the patch, and [~gtCarrera9] for your review! Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: YARN-2928 Attachments: YARN-3390.1.patch, YARN-3390.2.patch, YARN-3390.3.patch, YARN-3390.4.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3512) add more fine-grained metrics that measure write performance
Sangjin Lee created YARN-3512: - Summary: add more fine-grained metrics that measure write performance Key: YARN-3512 URL: https://issues.apache.org/jira/browse/YARN-3512 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee We need more fine-grained metrics in the load testing tool that measure the write performance of the timeline service. Currently it only captures the number of writes and bytes per sec from the API point of view. But the actual storage implementation may turn them into many more/fewer writes to the storage itself. We need more fine-grained data about what's going on in terms of actual writes to storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3437) convert load test driver to timeline service v.2
Sangjin Lee created YARN-3437: - Summary: convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3438) add a mode to replay MR job history files to the timeline service
Sangjin Lee created YARN-3438: - Summary: add a mode to replay MR job history files to the timeline service Key: YARN-3438 URL: https://issues.apache.org/jira/browse/YARN-3438 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee The subtask covers the work on top of YARN-3437 to add a mode to replay MR job history files to the timeline service storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
Sangjin Lee created YARN-3411: - Summary: [Storage implementation] explore the native HBase write schema for storage Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Priority: Critical There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3401) [Data Model] users should not be able to create a generic TimelineEntity and associate arbitrary type
Sangjin Lee created YARN-3401: - Summary: [Data Model] users should not be able to create a generic TimelineEntity and associate arbitrary type Key: YARN-3401 URL: https://issues.apache.org/jira/browse/YARN-3401 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee IIUC it is possible for users to create a generic TimelineEntity and set an arbitrary entity type. For example, for a YARN app, the right entity API is ApplicationEntity. However, today nothing stops users from instantiating a base TimelineEntity class and set the application type on it. This presents a problem in handling these YARN system entities in the storage layer for example. We need to ensure that the API allows only the right type of the class to be created for a given entity type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3377) TestTimelineServiceClientIntegration fails
Sangjin Lee created YARN-3377: - Summary: TestTimelineServiceClientIntegration fails Key: YARN-3377 URL: https://issues.apache.org/jira/browse/YARN-3377 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Priority: Minor TestTimelineServiceClientIntegration fails. It appears we are getting 500 from the timeline collector. This appears to be mostly an issue with the test itself. {noformat} --- Test set: org.apache.hadoop.yarn.server.timelineservice.TestTimelineServiceClientIntegration --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 33.503 sec FAILURE! - in org.apache.hadoop.yarn.server.timelineservice.TestTimelineServiceClientIntegration testPutEntities(org.apache.hadoop.yarn.server.timelineservice.TestTimelineServiceClientIntegration) Time elapsed: 32.606 sec ERROR! org.apache.hadoop.yarn.exceptions.YarnException: Failed to get the response from the timeline server. at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putObjects(TimelineClientImpl.java:457) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putObjects(TimelineClientImpl.java:391) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:368) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:342) at org.apache.hadoop.yarn.server.timelineservice.TestTimelineServiceClientIntegration.testPutEntities(TestTimelineServiceClientIntegration.java:74) {noformat} The relevant piece from the server side: {noformat} Mar 19, 2015 10:48:30 AM com.sun.jersey.api.core.PackagesResourceConfig init INFO: Scanning for root resource and provider classes in the packages: org.apache.hadoop.yarn.server.timelineservice.collector org.apache.hadoop.yarn.webapp org.apache.hadoop.yarn.webapp Mar 19, 2015 10:48:30 AM com.sun.jersey.api.core.ScanningResourceConfig logClasses INFO: Root resource classes found: class org.apache.hadoop.yarn.webapp.MyTestWebService class org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService Mar 19, 2015 10:48:30 AM com.sun.jersey.api.core.ScanningResourceConfig logClasses INFO: Provider classes found: class org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider class org.apache.hadoop.yarn.webapp.GenericExceptionHandler class org.apache.hadoop.yarn.webapp.MyTestJAXBContextResolver Mar 19, 2015 10:48:30 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' Mar 19, 2015 10:48:31 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator$8 resolve SEVERE: null java.lang.IllegalAccessException: Class com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator$8 can not access a member of class org.apache.hadoop.yarn.webapp.MyTestWebService$MyInfo with modifiers public at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:95) at java.lang.Class.newInstance0(Class.java:366) at java.lang.Class.newInstance(Class.java:325) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator$8.resolve(WadlGeneratorJAXBGrammarGenerator.java:467) at com.sun.jersey.server.wadl.WadlGenerator$ExternalGrammarDefinition.resolve(WadlGenerator.java:181) at com.sun.jersey.server.wadl.ApplicationDescription.resolve(ApplicationDescription.java:81) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.attachTypes(WadlGeneratorJAXBGrammarGenerator.java:518) at com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:124) at com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) at com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) at com.sun.jersey.server.impl.wadl.WadlMethodFactory$WadlOptionsMethodDispatcher.dispatch(WadlMethodFactory.java:98) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at
[jira] [Created] (YARN-3378) a load test client that can replay a volume of history files
Sangjin Lee created YARN-3378: - Summary: a load test client that can replay a volume of history files Key: YARN-3378 URL: https://issues.apache.org/jira/browse/YARN-3378 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee It might be good to create a load test client that can replay a large volume of history files into the timeline service. One can envision running such a load test client as a mapreduce job and generate a fair amount of load. It would be useful to spot check correctness, and more importantly observe performance characteristic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3353) provide RPC metrics via JMX for timeline collectors and readers
Sangjin Lee created YARN-3353: - Summary: provide RPC metrics via JMX for timeline collectors and readers Key: YARN-3353 URL: https://issues.apache.org/jira/browse/YARN-3353 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee We should provide RPC metrics via JMX for timeline collectors and readers. One challenge we may have is it might be difficult to provide a stable view for the metrics, given the distributed nature of the collectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3333) rename TimelineAggregator etc. to TimelineCollector
Sangjin Lee created YARN-: - Summary: rename TimelineAggregator etc. to TimelineCollector Key: YARN- URL: https://issues.apache.org/jira/browse/YARN- Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Per discussions on YARN-2928, let's rename TimelineAggregator, etc. to TimelineCollector, etc. There are also several minor issues on the current branch, which can be fixed as part of this: - fixing some imports - missing license in TestTimelineServerClientIntegration.java - whitespaces - missing direct dependency -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3167) implement the core functionality of the base aggregator service
Sangjin Lee created YARN-3167: - Summary: implement the core functionality of the base aggregator service Key: YARN-3167 URL: https://issues.apache.org/jira/browse/YARN-3167 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee The basic skeleton of the timeline aggregator has been set up by YARN-3030. We need to implement the core functionality of the base aggregator service. The key things include - handling the requests from clients (sync or async) - buffering data - handling the aggregation logic - invoking the storage API -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3037) create HBase cluster backing storage implementation for ATS writes
Sangjin Lee created YARN-3037: - Summary: create HBase cluster backing storage implementation for ATS writes Key: YARN-3037 URL: https://issues.apache.org/jira/browse/YARN-3037 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create a backing storage implementation for ATS writes based on a full HBase cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3041) create the ATS entity/event API
Sangjin Lee created YARN-3041: - Summary: create the ATS entity/event API Key: YARN-3041 URL: https://issues.apache.org/jira/browse/YARN-3041 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create the ATS entity and events API. Also, as part of this JIRA, create YARN system entities (e.g. cluster, user, flow, flow run, YARN app, ...). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3052) provide a very simple POC html ATS UI
Sangjin Lee created YARN-3052: - Summary: provide a very simple POC html ATS UI Key: YARN-3052 URL: https://issues.apache.org/jira/browse/YARN-3052 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee As part of accomplishing a minimum viable product, we want to be able to show some UI in html (however crude it is). This subtask calls for creating a barebones UI to do that. This should be replaced later with a better-designed and implemented proper UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3053) review and implement for property security in ATS v.2
Sangjin Lee created YARN-3053: - Summary: review and implement for property security in ATS v.2 Key: YARN-3053 URL: https://issues.apache.org/jira/browse/YARN-3053 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, we want to evaluate and review the system for security, and ensure proper security in the system. This includes proper authentication, token management, access control, and any other relevant security aspects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3030) set up ATS writer with basic request serving structure and lifecycle
Sangjin Lee created YARN-3030: - Summary: set up ATS writer with basic request serving structure and lifecycle Key: YARN-3030 URL: https://issues.apache.org/jira/browse/YARN-3030 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create an ATS writer as a service, and implement the basic service structure including the lifecycle management. Also, as part of this JIRA, we should come up with the ATS client API for sending requests to this ATS writer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3036) create standalone HBase backing storage implementation for ATS writes
Sangjin Lee created YARN-3036: - Summary: create standalone HBase backing storage implementation for ATS writes Key: YARN-3036 URL: https://issues.apache.org/jira/browse/YARN-3036 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create a (default) standalone HBase backing storage implementation for ATS writes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3042) create ATS metrics API
Sangjin Lee created YARN-3042: - Summary: create ATS metrics API Key: YARN-3042 URL: https://issues.apache.org/jira/browse/YARN-3042 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create the ATS metrics API and integrate it into the entities. The concept may be based on the existing hadoop metrics, but we want to make sure we have something that would satisfy all ATS use cases. It also needs to capture whether a metric should be aggregated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3038) handle ATS writer failure scenarios
Sangjin Lee created YARN-3038: - Summary: handle ATS writer failure scenarios Key: YARN-3038 URL: https://issues.apache.org/jira/browse/YARN-3038 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, consider various ATS writer failure scenarios, and implement proper handling. For example, ATS writers may fail and exit due to OOM. It should be retried a certain number of times in that case. We also need to tie fatal ATS writer failures (after exhausting all retries) to the application failure, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3047) set up ATS reader with basic request serving structure and lifecycle
Sangjin Lee created YARN-3047: - Summary: set up ATS reader with basic request serving structure and lifecycle Key: YARN-3047 URL: https://issues.apache.org/jira/browse/YARN-3047 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2938, set up the ATS reader as a service and implement the basic structure as a service. It includes lifecycle management, request serving, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3032) implement ATS writer functionality to serve ATS readers' requests for live apps
Sangjin Lee created YARN-3032: - Summary: implement ATS writer functionality to serve ATS readers' requests for live apps Key: YARN-3032 URL: https://issues.apache.org/jira/browse/YARN-3032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement the functionality in ATS writer to serve data for live apps coming from ATS readers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3039) implement ATS writer service discovery
Sangjin Lee created YARN-3039: - Summary: implement ATS writer service discovery Key: YARN-3039 URL: https://issues.apache.org/jira/browse/YARN-3039 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement ATS writer service discovery. This is essential for off-node clients to send writes to the right ATS writer. This should also handle the case of AM failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3051) create backing storage read interface for ATS readers
Sangjin Lee created YARN-3051: - Summary: create backing storage read interface for ATS readers Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3033) implement NM starting the ATS writer companion
Sangjin Lee created YARN-3033: - Summary: implement NM starting the ATS writer companion Key: YARN-3033 URL: https://issues.apache.org/jira/browse/YARN-3033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement node managers starting the ATS writer companion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3045) implement NM writing container lifecycle events and container system metrics to ATS
Sangjin Lee created YARN-3045: - Summary: implement NM writing container lifecycle events and container system metrics to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3044) implement RM writing app lifecycle events to ATS
Sangjin Lee created YARN-3044: - Summary: implement RM writing app lifecycle events to ATS Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3034) implement RM starting its ATS writer
Sangjin Lee created YARN-3034: - Summary: implement RM starting its ATS writer Key: YARN-3034 URL: https://issues.apache.org/jira/browse/YARN-3034 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement resource managers starting their own ATS writers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3031) create backing storage write interface for ATS writers
Sangjin Lee created YARN-3031: - Summary: create backing storage write interface for ATS writers Key: YARN-3031 URL: https://issues.apache.org/jira/browse/YARN-3031 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, come up with the interface for the ATS writer to write to various backing storages. The interface should be created to capture the right level of abstractions so that it will enable all backing storage implementations to implement it efficiently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3040) implement client-side API for handling flows
Sangjin Lee created YARN-3040: - Summary: implement client-side API for handling flows Key: YARN-3040 URL: https://issues.apache.org/jira/browse/YARN-3040 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, implement client-side API for handling *flows*. Frameworks should be able to define and pass in all attributes of flows and flow runs to YARN, and they should be passed into ATS writers. YARN tags were discussed as a way to handle this piece of information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3046) implement MapReduce AM writing some MR metrics to ATS
Sangjin Lee created YARN-3046: - Summary: implement MapReduce AM writing some MR metrics to ATS Key: YARN-3046 URL: https://issues.apache.org/jira/browse/YARN-3046 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes written) and have the MR AM write the framework-specific metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3035) create a test-only backing storage implementation for ATS writes
Sangjin Lee created YARN-3035: - Summary: create a test-only backing storage implementation for ATS writes Key: YARN-3035 URL: https://issues.apache.org/jira/browse/YARN-3035 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create a test-only bare bone backing storage implementation for ATS writes. We could consider something like a no-op or in-memory storage strictly for development and testing purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3043) create ATS configuration, metadata, etc. as part of entities
Sangjin Lee created YARN-3043: - Summary: create ATS configuration, metadata, etc. as part of entities Key: YARN-3043 URL: https://issues.apache.org/jira/browse/YARN-3043 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, create APIs for configuration, metadata, etc. and integrate them into entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3049) implement existing ATS queries in the new ATS design
Sangjin Lee created YARN-3049: - Summary: implement existing ATS queries in the new ATS design Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3048) handle how to set up and start/stop ATS reader instances
Sangjin Lee created YARN-3048: - Summary: handle how to set up and start/stop ATS reader instances Key: YARN-3048 URL: https://issues.apache.org/jira/browse/YARN-3048 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee Per design in YARN-2928, come up with a way to set up and start/stop ATS reader instances. This should allow setting up multiple instances and managing user traffic to those instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3007) TestNMWebServices#testContainerLogs fails intermittently
Sangjin Lee created YARN-3007: - Summary: TestNMWebServices#testContainerLogs fails intermittently Key: YARN-3007 URL: https://issues.apache.org/jira/browse/YARN-3007 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor TestNMWebServices#testContainerLogs fails intermittently with JDK 7: {noformat} java.lang.AssertionError: Failed to create log dir at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices.testContainerLogs(TestNMWebServices.java:336) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3007) TestNMWebServices#testContainerLogs fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-3007. --- Resolution: Invalid This issue is not reproducible in 2.7.0 or in trunk. Closing. TestNMWebServices#testContainerLogs fails intermittently Key: YARN-3007 URL: https://issues.apache.org/jira/browse/YARN-3007 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor TestNMWebServices#testContainerLogs fails intermittently with JDK 7: {noformat} java.lang.AssertionError: Failed to create log dir at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices.testContainerLogs(TestNMWebServices.java:336) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
Sangjin Lee created YARN-2928: - Summary: Application Timeline Server (ATS) next gen: phase 1 Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be address. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2774) shared cache uploader service should authorize notify calls properly
Sangjin Lee created YARN-2774: - Summary: shared cache uploader service should authorize notify calls properly Key: YARN-2774 URL: https://issues.apache.org/jira/browse/YARN-2774 Project: Hadoop YARN Issue Type: Task Reporter: Sangjin Lee The shared cache manager (SCM) uploader service (done in YARN-2186) currently does not authorize calls to notify the SCM on newly uploaded resource. Proper security/authorization needs to be done in this RPC call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2600) if the container is killed during localization outstanding public cache localization tasks should be cancelled
Sangjin Lee created YARN-2600: - Summary: if the container is killed during localization outstanding public cache localization tasks should be cancelled Key: YARN-2600 URL: https://issues.apache.org/jira/browse/YARN-2600 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Sangjin Lee We came across a situation (partly related with HDFS-7005) where a large number of public cache localization tasks were queued in the public localizer thread pool but the container is killed during localization (as it went over the timeout). What's not helpful in this situation is that any work item that's queued will still be serviced by the resource localization service which is wasteful. And that may further delay localization efforts of other containers. It would be good if we can cancel the pending localization tasks when the container is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2245) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used
Sangjin Lee created YARN-2245: - Summary: AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used Key: YARN-2245 URL: https://issues.apache.org/jira/browse/YARN-2245 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee With the job classloader enabled, the MR AM throws ClassNotFoundException if a custom output format class is specified. {noformat} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895) at org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469) ... 8 more Caused by: java.lang.ClassNotFoundException: Class com.foo.test.TestOutputFormat not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893) ... 10 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2238) filtering on UI sticks even if I move away from the page
Sangjin Lee created YARN-2238: - Summary: filtering on UI sticks even if I move away from the page Key: YARN-2238 URL: https://issues.apache.org/jira/browse/YARN-2238 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.4.0 Reporter: Sangjin Lee Attachments: filtered.png The main data table in many web pages (RM, AM, etc.) seems to show an unexpected filtering behavior. If I filter the table by typing something in the key or value field (or I suspect any search field), the data table gets filtered. The example I used is the job configuration page for a MR job. That is expected. However, when I move away from that page and visit any other web page of the same type (e.g. a job configuration page), the page is rendered with the filtering! That is unexpected. What's even stranger is that it does not render the filtering term. As a result, I have a page that's mysteriously filtered but doesn't tell me what it's filtering on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1465) define and add shared constants and utilities for the shared cache
[ https://issues.apache.org/jira/browse/YARN-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-1465. --- Resolution: Invalid I'll close out these JIRAs for YARN-1492, as the design has changed from the time these JIRAs were filed. define and add shared constants and utilities for the shared cache -- Key: YARN-1465 URL: https://issues.apache.org/jira/browse/YARN-1465 Project: Hadoop YARN Issue Type: New Feature Reporter: Sangjin Lee Assignee: Sangjin Lee -- This message was sent by Atlassian JIRA (v6.2#6252)