Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs

author: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
commit: ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree: 2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/consistency.txt
parent: efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download: ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
1 files changed, 10 insertions, 4 deletions
diff --git a/docs/consistency.txt b/docs/consistency.txt
index 91a0ee7..3769a60 100644
--- a/docs/consistency.txt
+++ b/docs/consistency.txt
@@ -9,6 +9,10 @@ General overview
     oc get pvc --all-namespaces -o wide
  - API health check
     curl -k https://apiserver.kube-service-catalog.svc/healthz
+ - Docker status (at each node)
+    docker info
+    * Enough Data and Metadata Space is available 
+    * The number of resident images is in check (>500-1000 - bad, >2000-3000 - terrible)
 
 Nodes
 =====
@@ -31,7 +35,7 @@ Storage
 Networking
 ==========
  - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods).
- If not fix and restart 'origin-node' and 'dnsmasq'.
+ If not fix and restart 'origin-node' and 'dnsmasq' (it happens that DNSMasq is just stuck).
     * '/etc/dnsmasq.d/origin-upstream-dns.conf'
     * '/etc/origin/node/resolv.conf'
 
@@ -46,12 +50,14 @@ Networking
  - Ensure, we don't have override of cluster_name to first master (which we do during the
  provisioning of OpenShift plays)
 
- - Sometimes OpenShift fails to clean-up after terminated pod properly. This causes rogue
- network interfaces to remain in OpenVSwitch fabric. This can be determined by errors like:
+ - Sometimes OpenShift fails to clean-up after terminated pod properly (this problem is particularly
+ triggered on the systems with huge number of resident docker images). This causes rogue network 
+ interfaces to  remain in OpenVSwitch fabric. This can be determined by errors like:
     could not open network device vethb9de241f (No such device)
  reported by 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' 
  which may quickly grow over 100MB quickly. If number of rogue interfaces grows too much,
- the pod scheduling will start time-out on the affected node. 
+ the pod scheduling gets even worse (compared to delays caused only be docker images) and 
+ will start time-out on the affected node. 
   * The work-around is to delete rogue interfaces with 
     ovs-vsctl del-port br0 <iface>
  This does not solve the problem, however. The new interfaces will get abandoned by OpenShift.
author	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
commit	ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree	2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/consistency.txt
parent	efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download	ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2 ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip