From 18da6e4b5942f4fcaa9db3ba3bf1dfcd1857e9ea Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Thu, 10 Jan 2019 06:43:26 +0100
Subject: Update troubleshooting documentation

---
 docs/consistency.txt                               | 14 +++--
 docs/problems.txt                                  | 59 +++++++++++++++++++---
 docs/webservices.txt                               |  5 +-
 .../templates/scripts/check_server_status.sh.j2    | 10 ++++
 4 files changed, 75 insertions(+), 13 deletions(-)
diff --git a/docs/consistency.txt b/docs/consistency.txt
index dcf311a..082a734 100644
--- a/docs/consistency.txt
+++ b/docs/consistency.txt
@@ -5,7 +5,7 @@ General overview
     oc get cs                                                   - only etcd (other services will fail on Openshift)
  - All nodes and pods are fine and running and all pvc are bound
     oc get nodes
-    oc get pods --all-namespaces -o wide
+    oc get pods --all-namespaces -o wide                        - Check also that no pods stuck in Terminating/Pending status for a long time
     oc get pvc --all-namespaces -o wide
  - API health check
     curl -k https://apiserver.kube-service-catalog.svc/healthz
@@ -50,10 +50,14 @@ Networking
     ovs-vsctl del-port br0 <iface>
  This does not solve the problem, however. The new interfaces will get abandoned by OpenShift.
 
-
 ADEI
 ====
  - MySQL replication is working
- - No caching pods are hung (for whatever reason)
-
-  
\ No newline at end of file
+ - No caching pods or maintenance pods are hung (for whatever reason)
+    * Check no ADEI pods stuck in Deleting/Pending status
+    * Check logs of 'cacher' and 'maintenace' scripts and ensure none is stuck on ages old time-stamp (unless we re-caching something huge)
+    * Ensure were is no old pending scripts in '/adei/tmp/adminscripts'
+ Possible reasons:
+    * Stale 'flock' locks (could be found out by analyzing backtraces in correspond /proc/<pid>/stack)
+    * Hunged connections to MySQL (could be found out by executing 'SHOW PROCESSLIST' on MySQL servers)
+    
\ No newline at end of file
diff --git a/docs/problems.txt b/docs/problems.txt
index 4be9dc7..fa88afe 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -17,6 +17,9 @@ Rogue network interfaces on OpenVSwitch bridge
   * As number of rogue interfaces grow, it start to have impact on performance. Operations with
   ovs slows down and at some point the pods schedulled to the affected node fail to start due to
   timeouts. This is indicated in 'oc describe' as: 'failed to create pod sandbox'
+  * With time, the new rogue interfaces are created faster and faster. At some point, it really
+  slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even 
+  if not so many rogue interfaces still present
 
  Cause:
   * Unclear, but it seems periodic ADEI cron jobs causes the issue.
@@ -25,7 +28,7 @@ Rogue network interfaces on OpenVSwitch bridge
 
          
  Solutions:
-  * According to RedHat the temporal solution is to reboot affected node (not tested yet). The problem
+  * According to RedHat the temporal solution is to reboot affected node (not helping in my case). The problem
   should go away, but may re-apper after a while. 
   * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
   problems only starts after hundreds accumulate.
@@ -35,6 +38,54 @@ Rogue network interfaces on OpenVSwitch bridge
    * Cron job is installed which cleans rogue interfaces as they number hits 25.
 
 
+Hanged pods
+===========
+ POD processes may stuck. Normally, such processes will be detected using 'liveliness' probe and will be 
+ restarted by OpenShift if necessary. However, ocasionally processes may stuck in syscalls (such processes
+ are marked with 'D' in ps). These processes can't be killed with SIGKILL and OpenShift will not be able
+ to terminate them leaving indefinitely in 'Terminating' status.
+ 
+ Problems:
+  * Pods stuck in 'Terminating' status preventing start of new replicas. In case of 'jobs', large number
+  of 'Terminating' pods could overload OpenShift controllers.
+
+ Cause:
+  * One reason are the spurious locks on the GlusterFS file system. On CentOS 7, it impossible to interrupt 
+  process waiting for the lock initiated by blocking 'flock' call. It gets stuck in a syscall and is indicated 
+  by state 'D' in the ps output. Sometimes, GlusterFS may kept files locked despite that processes holding these
+  locks have already exited/crashed. I am not sure about exact conditions when this happens, but it seems for 
+  instance the crashed Docker daemon may cause effect if some of the running containers were holding locks on 
+  GFS at the moment of crash. 
+    - We can verify if this is the case by checking if process associated with the problematic pod is stuck in 
+     state 'D' and by analyzing its backtrace (/proc/<pid>/stack).
+    
+
+ Solutions: 
+  * Avoid blocking flock on GlusterFS. Use polling with sleep instead. To release already stuck pods, we need
+  to find and destroy problematic locks. GlusterFS allows to debug locks using 'statedump', check GlusterFS 
+  documentation for details. While there is also mechanism to clean such locks. It is not always working. 
+  Alternative is to remove locked files AND keep them removed for a while until all blocked 'flock' syscalls 
+  are released.
+
+
+Hanged MySQL connection
+=======================
+ Stale MySQL locks may prevent new clients connecting to certain tables in MySQL database. 
+ 
+ Problems:
+  * The problem may affect either only clients trying to obtain 'write' access or for all usage patterns. In the first case, 
+  it will cause ADEI 'caching' threads to hang indefinitely and 'maintain' threads will be terminated after specified timeout
+  leaving administrative scripts un-processed.
+  
+ Cause:
+  * For whatever reason, some crashed clients may preserve the locks. I believe it could also be caused by 
+  crashed 'docker' daemon as one possibel reason. The problem can be found bt executing 'SHOW PROCESSLIST' 
+  on MySQL server. More diagnostic possibilities are discussed in MySQL notes.
+  
+ Solutions;
+  * Normally, restarting MySQL pod should be enough.
+
+
 Orphaning / pod termination problems in the logs
 ================================================
  There is several classes of problems reported with unknow reprecursions in the system log. Currently, I
@@ -96,8 +147,4 @@ Orphaning / pod termination problems in the logs
   Scenario:
     * Reported on long running pods with persistent volumes (katrin, adai-db)
     * Also seems an unrelated set of the problems.
-
-
-
-
-
+    
\ No newline at end of file
diff --git a/docs/webservices.txt b/docs/webservices.txt
index f535d46..2545bd5 100644
--- a/docs/webservices.txt
+++ b/docs/webservices.txt
@@ -43,8 +43,9 @@ Updating/Generating certificates for the router
  - Installing
     * Two files are needed.
         1) Secret Key
-        2) PEM file containing both certificate and secret key. No CA certificate is needed (at least if our
-        certifcate is signed by known CA)
+        2) PEM file containing both certificate and secret key. And all certificates in the chain
+        until the root certificate (the trusted root may be omitted, but including cause no problems
+        either
     * New 'router-certs' secret should be created in 'default' namespace. Probably it is better to 
     modify existing secret than delete/create. However, the strings can't just be copied. Easiest way 
     is to create a new secret in temporary namespace:
diff --git a/roles/ands_monitor/templates/scripts/check_server_status.sh.j2 b/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
index b02f031..0bef13c 100755
--- a/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
+++ b/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
@@ -43,3 +43,13 @@ vssize=$(du -sm /var/log/openvswitch/ovs-vswitchd.log | cut -f 1)
 if [ "$vssize" -gt 128 ]; then
     echo "Current OpenVSwitch log is over $vssize MB. It could indicate some severe problems in pod networking..."
 fi
+
+host google.com &> /dev/null
+if [ $? -ne 0 ]; then
+    echo "DNS problems, can't resolve google.com"
+fi
+
+ping -c 1 -W 2 8.8.8.8 &> /dev/null
+if [ $? -ne 0 ]; then
+    echo "Networkign problems, can't ping Google's public DNS server"
+fi
-- 
cgit v1.2.3