Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs

author: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
commit: ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree: 2e738d4e4774d754b56d79021cc8781b3c0835a5 /logs/2019.09.26/analysis.txt
parent: efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download: ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
1 files changed, 35 insertions, 0 deletions
diff --git a/logs/2019.09.26/analysis.txt b/logs/2019.09.26/analysis.txt
new file mode 100644
index 0000000..26c123b
--- /dev/null
+++ b/logs/2019.09.26/analysis.txt
@@ -0,0 +1,35 @@
+Sep 24 13:34:18 ipekatrin2 kernel: Memory cgroup out of memory: Kill process 57372 (mongod) score 1984 or sacrifice child
+Sep 24 13:34:22 ipekatrin2 origin-node: I0924 13:34:22.704691   93115 kubelet.go:1921] SyncLoop (container unhealthy): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)"
+Sep 24 13:34:29 ipekatrin2 origin-node: I0924 13:34:29.774596   93115 kubelet.go:1888] SyncLoop (PLEG): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)", event: &pleg.PodLifecycleEvent{ID:"b350130e-ac45-11e9-bbd6-0cc47adef0e6", Type:"ContainerStarted", Data:"1d485a4dd86b8f7ff24649789eee000d55319ef64d9b447c532a43fadce2831e"}
+Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177258   93115 roundrobin.go:310] LoadBalancerRR: Setting endpoints for services/mongodb:mongo to [10.130.0.91:27017]
+Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177323   93115 roundrobin.go:240] Delete endpoint 10.130.0.91:27017 for service "services/mongodb:mongo"
+... Nothing about mongod on any node until the mass destruction ....
+====
+Sep 25 07:52:00 ipekatrin2 origin-node: I0925 07:52:00.422291   93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s]
+Sep 25 07:52:31 ipekatrin2 origin-master-controllers: I0925 07:52:31.761961  109653 nodecontroller.go:617] Node is NotReady. Adding Pods on Node ipekatrin2.ipe.kit.edu to eviction queue
+Sep 25 07:52:47 ipekatrin2 origin-master-controllers: I0925 07:52:47.584394  109653 controller_utils.go:89] Starting deletion of pod services/mongodb-2-6j5w7
+Sep 25 07:56:04 ipekatrin2 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
+Sep 25 08:07:41 ipekatrin2 systemd-logind: Failed to start session scope session-118144.scope: Connection timed out
+====
+Sep 26 08:53:19 ipekatrin2 origin-master-controllers: I0926 08:53:19.435468  109653 nodecontroller.go:644] Node is unresponsive. Adding Pods on Node ipekatrin3.ipe.kit.edu to eviction queues: 
+Sep 26 08:54:09 ipekatrin3 kernel: glustertimer invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999
+Sep 26 08:54:27 ipekatrin3 kernel: Out of memory: Kill process 91288 (mysqld) score 1075 or sacrifice child
+Sep 26 08:54:14 ipekatrin2 etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader)
+Sep 26 08:55:02 ipekatrin2 etcd: established a TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 writer)
+Sep 26 08:57:54 ipekatrin3 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
+
+Sep 26 09:34:20 ipekatrin2 origin-node: I0926 09:34:20.361306   93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 8m12.284528292s ago; threshold is 3m0s]
+
+
+0. ipeatrin1 (and to lesser degree ipekatrin2) was affected by huge number of images slowing down the Docker communication.
+   Scheduling on ipekatrin1 was disabled for deveopment purposes.
+1. On 24th monogodb used more memory when allowed by 'dc' configuration and was killed by OpenShift/cgroup OOM.
+2. For some reason, the service was not restarted making rocketchat un-operationa;
+3. On 25.09 7:52 katrin2 get unhealthy and unschedularble due to PLEG timeouts?
+   * Pods migrating ipekatrin3. Performance problems due to mass migration causing systemd (and mount problems)
+   * System recovered relatively quickly, but few pods was running on ipekatrin2 and ipekatrin3 was severely overloaded
+4. On 26.09 8:53 System OOM killer was triggered on katrin3 due to overall lack of memory
+   * Node was marked unhealthy and pods eviction was triggered
+   * etcd problems registered, making real problems in cluster fabric 
+5. On 26.09 9:34 PLEG recovered for some reason.
+   * Most of the pods were rescheduled automatically and the systemwas recovered occasionally.
author	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
commit	ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree	2e738d4e4774d754b56d79021cc8781b3c0835a5 /logs/2019.09.26/analysis.txt
parent	efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download	ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2 ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip