[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-12-11 Thread pnowojski
Github user pnowojski commented on a diff in the pull request:

https://github.com/apache/flink/pull/4801#discussion_r156010369
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunnerITCase.java
 ---
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.taskexecutor;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricConfig;
+import org.apache.flink.metrics.reporter.AbstractReporter;
+import org.apache.flink.runtime.minicluster.MiniCluster;
+import org.apache.flink.runtime.minicluster.MiniClusterConfiguration;
+
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.Collection;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+import static org.apache.flink.configuration.MetricOptions.REPORTERS_LIST;
+import static 
org.apache.flink.configuration.TaskManagerOptions.ADDITIONAL_LOGGING;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+/**
+ * Integration tests for proper initialization in the {@link 
TaskManagerRunner}.
+ */
+public class TaskManagerRunnerITCase {
--- End diff --

Renamed to `SystemResourcesMetricsITCase`.

I have added this test after some larger rebasing conflicts, because 
otherwise I wasn't sure that everything is starting up correctly. I think that 
there would have to be a handful unit tests replacing this one `ITCase` and 
they would be more prone to fail/cause problems during refactoring.


---


[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-12-11 Thread pnowojski
Github user pnowojski commented on a diff in the pull request:

https://github.com/apache/flink/pull/4801#discussion_r156008613
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/utils/SystemResourcesCounter.java
 ---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.taskexecutor.utils;
+
+import org.apache.flink.api.common.time.Time;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import oshi.SystemInfo;
+import oshi.hardware.CentralProcessor;
+import oshi.hardware.CentralProcessor.TickType;
+import oshi.hardware.HardwareAbstractionLayer;
+import oshi.hardware.NetworkIF;
+
+import javax.annotation.concurrent.ThreadSafe;
+
+import java.util.concurrent.atomic.AtomicLongArray;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * Daemon thread logging system resources.
+ *
+ * To accurately and consistently report CPU and network usage we have to 
periodically probe
+ * CPU ticks and network sent/received bytes and then convert those values 
to CPU usage and
+ * send/receive byte rates.
+ */
+@ThreadSafe
--- End diff --

Hmmm :) This annotation is about public getters, but maybe indeed it 
doesn't make a lot of sense. On the other hand, is it harmful?


---


[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-12-07 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/4801#discussion_r155503740
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java
 ---
@@ -102,37 +103,16 @@ public static void instantiateStatusMetrics(
private static void instantiateNetworkMetrics(
MetricGroup metrics,
final NetworkEnvironment network) {
-   metrics.gauge("TotalMemorySegments", new 
Gauge () {
-   @Override
-   public Long getValue() {
-   return (long) 
network.getNetworkBufferPool().getTotalNumberOfMemorySegments();
-   }
-   });
 
-   metrics.gauge("AvailableMemorySegments", new 
Gauge () {
-   @Override
-   public Long getValue() {
-   return (long) 
network.getNetworkBufferPool().getNumberOfAvailableMemorySegments();
-   }
-   });
+   final NetworkBufferPool networkBufferPool = 
network.getNetworkBufferPool();
+   metrics.gauge("TotalMemorySegments", () -> 
(long) networkBufferPool.getTotalNumberOfMemorySegments());
--- End diff --

Replace with "Integer" Gauge and change to method reference?


---


[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-12-07 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/4801#discussion_r155503385
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/utils/SystemResourcesCounter.java
 ---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.taskexecutor.utils;
+
+import org.apache.flink.api.common.time.Time;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import oshi.SystemInfo;
+import oshi.hardware.CentralProcessor;
+import oshi.hardware.CentralProcessor.TickType;
+import oshi.hardware.HardwareAbstractionLayer;
+import oshi.hardware.NetworkIF;
+
+import javax.annotation.concurrent.ThreadSafe;
+
+import java.util.concurrent.atomic.AtomicLongArray;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * Daemon thread logging system resources.
+ *
+ * To accurately and consistently report CPU and network usage we have to 
periodically probe
+ * CPU ticks and network sent/received bytes and then convert those values 
to CPU usage and
+ * send/receive byte rates.
+ */
+@ThreadSafe
--- End diff --

A Thread is ThreadSafe? ;-)


---


[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-12-07 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/4801#discussion_r155501129
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunnerITCase.java
 ---
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.taskexecutor;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricConfig;
+import org.apache.flink.metrics.reporter.AbstractReporter;
+import org.apache.flink.runtime.minicluster.MiniCluster;
+import org.apache.flink.runtime.minicluster.MiniClusterConfiguration;
+
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.Collection;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+import static org.apache.flink.configuration.MetricOptions.REPORTERS_LIST;
+import static 
org.apache.flink.configuration.TaskManagerOptions.ADDITIONAL_LOGGING;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+/**
+ * Integration tests for proper initialization in the {@link 
TaskManagerRunner}.
+ */
+public class TaskManagerRunnerITCase {
--- End diff --

This test seems very specific to this logging, but is named as a generic 
TaskManagerRunner test. Give it a differnet name?

Separate question: Does it have to be an IT case that fully starts the TM, 
or can it be a unit test that checks config propagation?


---


[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

2017-10-11 Thread pnowojski
GitHub user pnowojski opened a pull request:

https://github.com/apache/flink/pull/4801

[FLINK-7812] Log system resources metrics

## What is the purpose of the change

This PR adds various system resources metrics, useful for analysing issues 
on machines/clusters for which there are no detailed external resources logging 
systems.

## Verifying this change

This change was manually tested, since it's difficult to write automated 
tests for this feature.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (**yes** / no)

It adds `ohci` dependency.

## Documentation

  - Does this pull request introduce a new feature? (**yes** / no)
  - If yes, how is the feature documented? (not applicable / **docs** / 
**JavaDocs** / not documented)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pnowojski/flink resources

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4801.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4801


commit d337e322477486daad19fad37747e6a8898bf619
Author: Piotr Nowojski 
Date:   2017-10-10T15:24:21Z

[hotfix][metrics] Replace anonymous classes with lamdas

commit 37eaf16bb315be35559d8ac7f5a98c38cfc1e9a4
Author: Piotr Nowojski 
Date:   2017-10-11T10:06:01Z

[hotfix][metrics] Remove redundant TaskManager metrics initialization

commit 1b2420746a613a2af73486197d4fa3aeb63f7236
Author: Piotr Nowojski 
Date:   2017-10-11T07:10:49Z

[FLINK-7812][metrics] Add system resources metrics




---