Various fixes before moving to hardware installation

author: Suren A. Chilingaryan <csa@suren.me> 2018-03-11 19:56:38 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2018-03-11 19:56:38 +0100
commit: f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree: 3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs
parent: 6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
download: ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip
12 files changed, 673 insertions, 1 deletions
diff --git a/docs/ands_ansible.txt b/docs/ands_ansible.txt
index 80a7cf0..70800e1 100644
--- a/docs/ands_ansible.txt
+++ b/docs/ands_ansible.txt
@@ -89,7 +89,7 @@ Ansible parameters (global)
     glusterfs_version           group_vars
     glusterfs_transport         group_vars
 
- - OPenShift specific
+ - OpenShift specific
     ands_openshift_labels       setup/configs   Labels to assign to the nodes
     ands_openshift_projects     setup/configs   List of projects to configure (with GlusterFS endpoints, etc.)
     ands_openshift_users        setup/configs   Optional list of user names with contacts
diff --git a/docs/backup.txt b/docs/backup.txt
new file mode 100644
index 0000000..1b25592
--- /dev/null
+++ b/docs/backup.txt
@@ -0,0 +1,26 @@
+Critical directories and services
+---------------------------------
+ - etcd database [ once ]
+    * There is etcd2 and etcd3 APIs. OpenShift 3.5+ uses etcd3, but documentation
+    still describes etcd2-style backup. etcd3 is backward compatible with etcd2,
+    and we can run etcd2 backup as well. Now the question if we need to backup
+    both ways (OpenShift 3.5 is definitively has etcd3 data) or just etcd3 
+    considering it is a bug in documentation.
+    * etcd3
+        etcdctl3 --endpoints="192.168.213.1:2379" snapshot save snapshot.db
+    * etcd2
+        etcdctl backup --data-dir /var/lib/etcd/ --backup-dir .
+        cp "$ETCD_DATA_DIR"/member/snap/db member/snap/db
+
+ - heketi topology [ once ]
+    heketi-cli -s  http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info --json 
+
+ - Gluster volume information [ storage nodes ]
+    * /var/lib/glusterd/glusterd.info
+    * /var/lib/glusterd/peers
+    * /var/lib/glusterd/glustershd                      - not mentioned in docs
+
+  - etc [ all nodes ]
+    * /etc/origin/                                      - Only *.key *.crt from /etc/origin/master in docs
+    * /etc/etcd                                         - Not mentioned
+    * /etc/docker                                       - Only certs.d
diff --git a/docs/consistency.txt b/docs/consistency.txt
new file mode 100644
index 0000000..127d9a7
--- /dev/null
+++ b/docs/consistency.txt
@@ -0,0 +1,36 @@
+General overview
+=================
+ - etcd services (worth checking both ports)
+    etcdctl3 --endpoints="192.168.213.1:2379" member list       - doesn't check health only reports members
+    oc get cs                                                   - only etcd (other services will fail on Openshift)
+ - All nodes and pods are fine and running and all pvc are bound
+    oc get nodes
+    oc get pods --all-namespaces -o wide
+    oc get pvc --all-namespaces -o wide
+ - API health check
+    curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+Storage
+=======
+ - Heketi status 
+    heketi-cli -s  http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info
+ - Status of Gluster Volume (and its bricks which with heketi fails often)
+    gluster volume info
+    ./gluster.sh info all_heketi
+ - Check available storage space on system partition and  LVM volumes (docker, heketi, ands)
+    Run 'df -h' and 'lvdisplay' on each node
+
+Networking
+==========
+ - Check that both internal and external addresses are resolvable from all hosts.
+    * I.e. we should be able to resolve 'google.com'
+    * And we should be able to resolve 'heketi-storage.glusterfs.svc.cluster.local'
+    
+ - Check that keepalived service is up and the corresponding ip's are really assigned to one
+ of the nodes (vagrant provisioner would remove keepalived tracked ips, but keepalived will
+ continue running without noticing it)
+ 
+ - Ensure, we don't have override of cluster_name to first master (which we do during the
+ provisioning of OpenShift plays)
+ 
+ 
+\ No newline at end of file
diff --git a/docs/managment.txt b/docs/managment.txt
new file mode 100644
index 0000000..1eca8a8
--- /dev/null
+++ b/docs/managment.txt
@@ -0,0 +1,166 @@
+DOs and DONTs
+=============
+ Here we discuss things we should do and we should not do!
+ 
+ - Scaling up cluster is normally problem-less. Both nodes & masters can be added
+ fast and without much troubles afterwards. 
+
+ - Upgrade procedure may cause the problems. The main trouble that many pods are 
+ configured to use the 'latest' tag. And the latest versions has latest problems (some
+ of the tags can be fixed to actual version, but finding that is broken and why takes
+ a lot of effort)...
+    * Currently, there is problems if 'kube-service-catalog' is updated  (see discussion
+    in docs/upgrade.txt). While it seems nothing really changes, the connection between
+    apiserver and etcd breaks down (at least for health checks). The intallation reamins
+    pretty much usable, but not in healthy state. This particular update is blocked by
+    setting. 
+        openshift_enable_service_catalog: false
+    Then, it is left in 'Error' state, but can be easily recovered by deteleting and 
+    allowing system to re-create a new pod. 
+    * However, as cause is unclear, it is possible that something else with break as time
+    passes and new images are released. It is ADVISED to check upgrade in staging first.
+    * During upgrade also other system pods may stuck in Error state (as explained
+    in troubleshooting) and block the flow of upgrade. Just delete them and allow
+    system to re-create to continue.
+    * After upgrade, it is necessary to verify that all pods are operational and 
+    restart ones in 'Error' states.
+
+ - Re-running install will break on heketi. And it will DESTROY heketi topology!
+ DON"T DO IT! Instead a separate components can be re-installed.
+    * For instance to reinstall 'openshift-ansible-service-broker' use
+         openshift-install-service-catalog.yml
+    * There is a way to prevent plays from touching heketi, we need to define
+        openshift_storage_glusterfs_is_missing: False
+        openshift_storage_glusterfs_heketi_is_missing: False
+    But I am not sure if it is only major issue.
+
+ - Few administrative tools could cause troubles. Don't run
+    * oc adm diagnostics
+
+
+Failures / Immidiate
+========
+ - We need to remove the failed node from etcd cluster
+    etcdctl3 --endpoints="192.168.213.1:2379" member list
+    etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+
+ - Further, the following is required on all remaining nodes if the node is forever gone
+    * Delete node 
+        oc delete node
+    * Remove it also from /etc/etcd.conf on all nodes ETCD_INITIAL_CLUSTER
+    * Remove failed nodes from 'etcdClinetInfo' section in /etc/origin/master/master-config.yaml
+        systemctl restart origin-master-api.service 
+    
+Scaling / Recovery
+=======
+ - One important point.
+  * If we lost data on the storage node, it should be re-added with different name (otherwise
+  the GlusterFS recovery would be significantly more complicated)
+  * If Gluster bricks are preserved, we may keep the name. I have not tried, but according to
+  documentation, it should be possible to reconnect it back and synchronize. Still it may be 
+  easier to use a new name again to simplify procedure.
+  * Simple OpenShift nodes may be re-added with the same name, no problem.
+
+ - Next we need to perform all prepartion steps (the --limit should not be applied as we normally
+ need to update CentOS on all nodes to synchronize software versions; list all nodes in /etc/hosts 
+ files; etc).
+    ./setup.sh -i staging prepare
+
+ - The OpenShift scale is provided as several ansible plays (scale-masters, scale-nodes, scale-etcd).
+  * Running 'masters' will also install configured 'nodes' and 'etcd' daemons
+  * I guess running 'nodes' will also handle 'etcd' daemons, but I have not checked.
+
+Problems
+--------
+ - There should be no problems if a simple node crashed, but things may go wrong if one of the 
+ masters is crashed. And things definitively will go wrong if complete cluster will be cut from the power.
+  * Some pods will be stuck polling images. This happens if node running docker-registry have crashed
+  and the persistent storage was not used to back the registry. It can be fixed by re-schedulling build 
+  and roling out the latest version from dc.
+        oc -n adei start-build adei
+        oc -n adei rollout latest mysql
+    OpenShift will trigger rollout automatically in some time, but it will take a while. The builds 
+    should be done manually it seems.
+  * In case of long outtage some CronJobs will stop execute. The reason is some protection against
+  excive loads and missing defaults. Fix is easy, just setup how much time the OpenShift scheduller
+  allows to CronJob to start before considering it failed:
+    oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 10 }}'
+
+ - if we forgot to remove old host from etcd cluster, the OpenShift node will be configured, but etcd
+ will not be installed. We need, then, to remove the node as explained above and run scale of etcd
+ cluster.
+    * In multiple ocasions, the etcd daemon has failed after reboot and needed to be resarted manually.
+    If half of the daemons is broken, the 'oc' will block.    
+
+    
+
+Storage / Recovery
+=======
+ - Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed 
+ automatically by scale plays. The 'glusterfs' play should be executed with additional options
+ specifying that we are just re-configuring nodes. We can check if all pods are serviced
+    oc -n glusterfs get pods -o wide
+ Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running
+ should be not an issue.
+ 
+ - More details:
+    https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html
+
+
+Heketi
+------
+ - With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the
+ bricks to other servers (as he thinks fit).
+    * Accessing heketi
+        heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"  
+    * Gettiing required ids
+        heketi-cli topology info
+    * Removing node
+        heketi-cli node info <failed_node_id>
+        heketi-cli node disable <failed_node_id>
+        heketi-cli node remove <failed_node_id>
+    * Thats it. A few self-healing daemons are running which should bring the volumes in order automatically.
+    * The node will still persist in heketi topology as failed, but will not be used ('node delete' potentially could destroy it, but it is failin)
+
+ - One problem with heketi, it may start volumes before bricks get ready. Consequently, it may run volumes with several bricks offline. It should be
+ checked and fixed by restarting the volumes.
+ 
+KaaS Volumes
+------------
+ There is two modes. 
+ - If we migrated to a new server, we need to migrate bricks (force is required because
+ the source break is dead and data can't be copied)
+        gluster volume replace-brick <volume> <src_brick> <dst_brick>  commit force
+    * There is healing daemons running and nothing else has to be done.
+    * There play and scripts available to move all bricks automatically
+
+ - If we kept the name and the data is still there, it should be also relatively easy
+ to perform migration (not checked). We also should have backups of all this data.
+    * Ensure Gluster is not running on the failed node
+        oadm manage-node ipeshift2 --schedulable=false
+        oadm manage-node ipeshift2 --evacuate
+    * Verify the gluster pod is not active. It may be running, but not ready.
+    Could be double checked with 'ps'.
+        oadm manage-node ipeshift2 --list-pods
+    * Get the original Peer UUID of the failed node (by running on healthy node)
+        gluster peer status
+    * And create '/var/lib/glusterd/glusterd.info' similar to the one on the 
+    healthy nodes, but with the found UUID.
+    * Copy peers from the healthy nodes to /var/lib/glusterd/peers. We need to
+    copy from 2 nodes as node does not hold peer information on itself.
+    * Create mount points and re-schedule gluster pod. See more details
+        https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts
+    * Start healing
+        gluster volume heal VOLNAME full
+
+ - However, if data is lost, it is quite complecated to recover using the same server name. 
+ We should rename the server and use first approach instead.
+ 
+ 
+ 
+Scaling
+=======
+We have currently serveral assumptions which will probably not hold true for larger clusters
+ - Gluster
+    To simplify matters we just reference servers in the storage group manually
+    Arbiter may work for several groups and we should define several brick path in this case
diff --git a/docs/network.txt b/docs/network.txt
new file mode 100644
index 0000000..a164d36
--- /dev/null
+++ b/docs/network.txt
@@ -0,0 +1,58 @@
+Configuration
+=============
+openshift_ip                                    Infiniband IPs for fast communication (it also used for ADEI/MySQL bridge 
+                                                and so should reside on fast network.
+openshift_hostname                              The 'cluster' host name. Should match real host name for certificat validation.
+                                                So, it should be set if default ip does not resolve to host name
+openshift_public_ip                             We may either skip this or set to our 192.168.26.xxx network. Usage is unclear
+openshift_public_hostname                       I guess it is also for certificates, but while communicating with external systems
+openshift_master_cluster_hostname               Internal cluster load-balancer or just pointer to master host
+openshift_public_master_cluster_hostname        The main cluster gateway
+
+
+Complex Network
+===============
+Some things in OpenShift ansible scripts are still implemented with assumption we have 
+a simple network configuration with a single interface communicating to the world. There
+are several options to change this:
+  openshift_set_node_ip - This variable configures nodeIP in the node configuration. This 
+  variable is needed in cases where it is desired for node traffic to go over an interface 
+  other than the default network interface. 
+  openshift_ip - This variable overrides the cluster internal IP address for the system. 
+  Use this when using an interface that is not configured with the default route.
+  openshift_hostname - This variable overrides the internal cluster host name for the system. 
+  Use this when the system’s default IP address does not resolve to the system host name.
+Furthermore, if we use infiniband which is not accessible to outside world we need to set
+  openshift_public_ip -  Use this for cloud installations, or for hosts on networks using 
+  a network address translation
+  openshift_public_hostname - Use this for cloud installations, or for hosts on networks 
+  using a network address translation (NAT).
+
+ This is, however, is not used trough all system components. Some provisioning code and
+installed scripts are still detect kind of 'main system ip' to look for the
+services. This ip is intendified either as 'ansible_default_ip' or by the code trying
+to look for the ip which is used to send packet over default route. Ansible in the end does
+the some thing. This plays bad for several reasons. 
+ - We have keepalived ips moving between systems. The scripts are actually catching
+ this moving ips instead of the fixed ip bound to the system. 
+ - There could be several default routes. While it is not a problem, scripts does not expect
+ that and may fail.
+ 
+For instance, the script '99-origin-dns.sh' in /etc/NetworkManager/dispatcher.d. 
+    * def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }')
+ 1) Does not expect multiple default routes and will find just a random one. Then, 
+    * if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then   
+  check may fail and the resolv.conf will be not updated because currently up'ed 
+  interface is not on default route, but it actually is. Furthermore,
+    * def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}')
+ 2) ignorant of keepalived and will bound to keepalived.
+ 
+ But I am not sure the problems are limited to this script. There could be other places with
+ the same logic. Some details are here:
+ https://docs.openshift.com/container-platform/3.7/admin_guide/manage_nodes.html#manage-node-change-node-traffic-interface
+
+Hostnames
+=========
+ The linux host name (uname -a) should match the hostnames assigned to openshift nodes. Otherwise, the certificate verification
+ will fail. It seems minor issue as system continue functioning, but better to avoid. The check can be performed with etcd:
+    etcdctl3  --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints="192.168.213.1:2379,192.168.213.3:2379,192.168.213.4:2379"
diff --git a/docs/pods.txt b/docs/pods.txt
new file mode 100644
index 0000000..b84f42f
--- /dev/null
+++ b/docs/pods.txt
@@ -0,0 +1,13 @@
+Updating Daemon Set
+===================
+ - Not trivial. We need to 
+    a) Re-recreate ds
+        * Manualy change 'imagePullPolicty' to 'Always' if it is set to 'IfNotExisting'
+    b) Destory all nodes and allow ds to recreate them
+
+ - Sample: Updateing gluster
+    oc -n glusterfs delete ds/glusterfs-storage
+    oc -n glusterfs process glusterfs IMAGE_NAME=chsa/gluster-centos IMAGE_VERSION=312 > gluster.json
+        *** Edit
+    oc -n glusterfs create -f gluster.json
+    oc -n glusterfs delete pods -l 'glusterfs=storage-pod'
diff --git a/docs/regions.txt b/docs/regions.txt
new file mode 100644
index 0000000..88b8f5e
--- /dev/null
+++ b/docs/regions.txt
@@ -0,0 +1,16 @@
+region=infra            Infrastructure nodes which are used by OpenShift to run router and registry services. This is 
+                        more or less ipekatrin* nodes down in the basement.
+region=prod             Production servers (ipecompute*, etc.) located anythere, but I expect only basement.
+region=dev              Temporary nodes
+
+zone=default            Basement
+zone=404                Second server room on 4th floor
+zone=student            Student room
+zone=external           Other external places
+
+
+
+production: 1           Specifies all production servers (no extra load, no occasional reboots)
+                        This includes 'infra' and 'prod' regions.
+server: 1               Like production, but with occasional reboots and some extra testing load possible
+permanent: 1            Non-production systems, but which are permanently connected to OpenShift
diff --git a/docs/samples/templates/00-katrin-restricted.yml.j2 b/docs/samples/templates/00-katrin-restricted.yml.j2
new file mode 100644
index 0000000..6221f30
--- /dev/null
+++ b/docs/samples/templates/00-katrin-restricted.yml.j2
@@ -0,0 +1,44 @@
+# Overriding SCC rules to allow arbitrary gluster mounts in restricted containers
+---
+allowHostDirVolumePlugin: false
+allowHostIPC: false
+allowHostNetwork: false
+allowHostPID: false
+allowHostPorts: false
+allowPrivilegedContainer: false
+allowedCapabilities: null
+apiVersion: v1
+defaultAddCapabilities: null
+fsGroup:
+  type: MustRunAs
+groups:
+- system:authenticated
+kind: SecurityContextConstraints
+metadata:
+  annotations:
+    kubernetes.io/description: restricted denies access to all host features and requires
+      pods to be run with a UID, and SELinux context that are allocated to the namespace.  This
+      is the most restrictive SCC.
+  creationTimestamp: null
+  name: katrin-restricted
+priority: null
+readOnlyRootFilesystem: false
+requiredDropCapabilities:
+- KILL
+- MKNOD
+- SYS_CHROOT
+- SETUID
+- SETGID
+runAsUser:
+  type: MustRunAsRange
+seLinuxContext:
+  type: MustRunAs
+supplementalGroups:
+  type: RunAsAny
+volumes:
+- glusterfs
+- configMap
+- downwardAPI
+- emptyDir
+- persistentVolumeClaim
+- secret
diff --git a/docs/samples/vars/run_oc.yml b/docs/samples/vars/run_oc.yml
new file mode 100644
index 0000000..a464549
--- /dev/null
+++ b/docs/samples/vars/run_oc.yml
@@ -0,0 +1,6 @@
+oc:
+  - template: "[0-3]*"
+  - template: "[4-6]*"
+  - resource: "route/apache" 
+    oc: "expose svc/kaas --name apache --hostname=apache.{{ openshift_master_default_subdomain }}"
+  - template: "*"
diff --git a/docs/samples/vars/variants.yml b/docs/samples/vars/variants.yml
new file mode 100644
index 0000000..c7a27b4
--- /dev/null
+++ b/docs/samples/vars/variants.yml
@@ -0,0 +1,33 @@
+# First port is exposed
+
+pods:
+  kaas:
+    variant: "{{ ands_prefer_docker | default(false) | ternary('docker', 'centos') }}"
+    centos:
+      service: { host: "{{ katrin_node }}", ports: [ 80/8080, 443/8043 ] }
+      sched: { replicas: 1, selector: { master: 1 } }
+      selector: { master: 1 }
+      images:
+        - image: "centos/httpd-24-centos7"
+          mappings: 
+            - { name: "etc", path: "apache2-kaas-centos", mount: "/etc/httpd" }
+            - { name: "www", path: "kaas", mount: "/opt/rh/httpd24/root/var/www/html" }
+            - { name: "log", path: "apache2-kaas", mount: "/var/log/httpd24" }
+          probes:
+            - { port: 8080, path: '/index.html' }
+    docker:
+      service: { host: "{{ katrin_node }}", ports: [ 80/8080, 443/8043 ] }
+      sched: { replicas: 1, selector: { master: 1 } }
+      selector: { master: 1 }
+      images:
+        - image: "httpd:2.2"
+          mappings: 
+            - { name: "etc", path: "apache2-kaas-docker", mount: "/usr/local/apache2/conf" }
+            - { name: "www", path: "kaas", mount: "/usr/local/apache2/htdocs" }
+            - { name: "log", path: "apache2-kaas", mount: "/usr/local/apache2/logs" }
+          probes:
+            - { port: 8080, path: '/index.html' }
+
+
+
+  
+\ No newline at end of file
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
new file mode 100644
index 0000000..b4ac8e7
--- /dev/null
+++ b/docs/troubleshooting.txt
@@ -0,0 +1,210 @@
+The services has to be running
+------------------------------
+  Etcd:
+    - etcd 
+
+  Node:
+    - origin-node
+ 
+  Master nodes:
+    - origin-master-api
+    - origin-master-controllers
+    - origin-master is not running
+
+  Required Services:
+    - lvm2-lvmetad.socket 
+    - lvm2-lvmetad.service
+    - docker
+    - NetworkManager
+    - firewalld
+    - dnsmasq
+    - openvswitch
+ 
+  Extra Services:
+    - ssh
+    - ntp
+    - openvpn
+    - ganesha (on master nodes, optional)
+
+Pods has to be running
+----------------------
+  Kubernetes System
+    - kube-service-catalog/apiserver
+    - kube-service-catalog/controller-manager
+  
+  OpenShift Main Services
+    - default/docker-registry
+    - default/registry-console
+    - default/router (3 replicas)
+    - openshift-template-service-broker/api-server (daemonset, on all nodes)
+
+  OpenShift Secondary Services
+    - openshift-ansible-service-broker/asb
+    - openshift-ansible-service-broker/asb-etcd
+
+  GlusterFS
+     - glusterfs-storage (daemonset, on all storage nodes)
+     - glusterblock-storage-provisioner-dc
+     - heketi-storage
+
+  Metrics (openshift-infra):
+    - hawkular-cassandra
+    - hawkular-metrics
+    - heapster
+    
+
+Debugging
+=========
+ - Ensure system consistency as explained in 'consistency.txt' (incomplete)
+ - Check current pod logs and possibly logs for last failed instance
+        oc logs <pod name> --tail=100 [-p]                  - dc/name or ds/name as well
+ - Verify initialization steps (check if all volumes are mounted)
+        oc describe <pod name>
+ - It worth looking the pod environment
+        oc env po <pod name> --list
+ - It worth connecting running container with 'rsh' session and see running processes,
+ internal logs, etc. The 'debug' session will start a new instance of the pod.
+ - If try looking if corresponding pv/pvc are bound. Check logs for pv.
+    * Even if 'pvc' is bound. The 'pv' may have problems with its backend.
+    * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/
+ - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it
+ immediately crashes it is not possible to connect. Remedies are:
+    * Set larger initial delay to check the probe.
+    * Try to remove hook and execute it using 'rsh'/'debug'
+ - Determine node running the pod and check the host logs in '/var/log/messages'
+    * Particularly logs of 'origin-master-controllers' are of interest
+ - Check which docker images are actually downloaded on the node
+        docker images
+
+network
+=======
+ - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server.
+ This is based on  '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well 
+ if OpenShift is running on non-default network interface. I provided a patched version, but it
+ worth verifying 
+    * that nameserver is pointing to the host itself (but not localhost, this is important
+    to allow running pods to use it)
+    * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+    * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
+ If script misbehaves, it is possible to call it manually like that
+    DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
+
+
+etcd (and general operability)
+====
+ - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it 
+ may be needed to restart them manually. I have noticed it with 
+    * lvm2-lvmetad.socket       (pvscan will complain on problems)
+    * node-origin
+    * etcd               but BEWARE of too entusiastic restarting:
+ - However, restarting etcd many times is BAD as it may trigger a severe problem with 
+ 'kube-service-catalog/apiserver'. The bug description is here
+        https://github.com/kubernetes/kubernetes/issues/47131
+ - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper
+ solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while.
+ The pod is restarted and response times are back in order.
+    * Another way to see this problem is quering 'healthz' service which would tell that
+    there is too many clients and, please, retry later.
+        curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+ - On node crash, the etcd database may get corrupted. 
+    * There is no easy fix. Backup/restore is not working.
+    * Easiest option is to remove the failed etcd from the cluster.
+        etcdctl3 --endpoints="192.168.213.1:2379" member list
+        etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+    * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster.
+ 
+ - There is a helth check provided by the cluster
+    curl -k https://apiserver.kube-service-catalog.svc/healthz
+ it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and
+ remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt
+ 
+ - There is also a different etcd which is integral part of the ansible service broker: 
+ 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains 
+ on:
+        2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+        WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
+ Nevertheless, it seems working without much trouble. The error message seems caused by
+ certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
+ the issue.
+ 
+pods (failed pods, rogue namespaces, etc...)
+====
+ - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
+    * kube-service-catalog/controller-manager
+    * openshift-template-service-broker/api-server
+ Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems.
+    for name in  $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done
+    for name in  $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done 
+ 
+ - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will 
+ not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be 
+ re-executed manually.
+        oc -n adei start-build adei
+
+ - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially
+        systemctl status origin-master-controllers
+  it will contain something like:
+        'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.'
+  * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds'
+  which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only 
+  counted the specified period. I.e. we should set period bellow the 'launch period / 100'.
+        https://github.com/kubernetes/kubernetes/issues/45825
+  * The running CronJobs can be easily patched with
+        oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}'
+  
+ - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly
+    * Crash of both masters during population / destruction of OpenShift resources
+    * Running of 'oc adm diagnostics'
+  It is unclear how to remove them manually, but it seems if we run
+    * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
+    * ... i don't know if install, etc. May cause the trouble...
+
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
+ work for a long time. It worth
+    * Determining the host running failed pod with 'oc get pods -o wide'
+    * Going to the pod and killing processes and stopping the container using docker command
+    * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
+        - This can be done with 'find . -name heketi*' or something like...
+        - There could be problematic mounts which can be freed with lazy umount
+        - The folders for removed pods may (and should) be removed.
+
+ - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
+    * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+        The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
+    * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+        - We can find and remove the corresponding container (the short id is just first letters of the long id)
+                docker ps -a | grep aa28e9c76
+                docker rm <id>
+        - We further can just destroy all containers which are not running (it will actually try to remove all,
+        but just error message will be printed for running ones)
+                docker ps -aq --no-trunc | xargs docker rm
+
+
+Storage
+=======
+ - Running a lot of pods may exhaust available storage. It worth checking if 
+    * There is enough Docker storage for containers (lvm)
+    * There is enough Heketi storage for dynamic volumes (lvm)
+    * The root file system on nodes still has space for logs, etc. 
+  Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
+  under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
+  destroy'. So, it should be cleaned manually.
+  
+ - Problems with pvc's can be evaluated by running 
+        oc  -n openshift-ansible-service-broker describe pvc etcd
+   Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories
+   with pods executed on this host which are mount this pod and holds the log for this pods.
+        /var/lib/origin/plugins/kubernetes.io/glusterfs/
+
+ - Heketi is problematic.
+    * Worth checking if topology is fine and running.
+        heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
+ - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can
+ be checked with 
+        gluster volume status <vol> detail
+    * If not all bricks online, likely it is just enought to restart the volume
+        gluster volume stop <vol>
+        gluster volume start <vol>
+    * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
+    
diff --git a/docs/upgrade.txt b/docs/upgrade.txt
new file mode 100644
index 0000000..b4f22d6
--- /dev/null
+++ b/docs/upgrade.txt
@@ -0,0 +1,64 @@
+Upgrade
+-------
+ - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install.
+ Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it.
+ Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical 
+ service on the 'latest' tag). Update may break things down.
+ 
+kube-service-catalog
+--------------------
+ - Update of 'kube-service-catalog' breaks OpenShift health check
+        curl -k https://apiserver.kube-service-catalog.svc/healthz
+ It complains on 'etcd'. The speific etcd check
+    curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd
+ reports that all servers are unreachable.
+ 
+ - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug
+ described here:
+        https://github.com/kubernetes/kubernetes/issues/47131
+ The 'oc' queries are extremely slow and healthz service reports that there is too many connections.
+ Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly.
+
+ - The information bellow is attempt to understand the reason. In fact, it is the list specifying that
+ is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting
+         openshift_enable_service_catalog: false
+
+ - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some 
+ miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is 
+ operational, so the connection is not completely lost, but is not working as expected in some
+ circustmances.
+
+ - There is no significant changes. The exactly same docker images are installed. The only change in
+ '/etc' is updated certificates used by 'apiserver' and 'control-manager'. 
+    * The certificates are located in '/etc/origin/service-catalog/' on the first master server. 
+    'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They
+    are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in
+    'apiserver' and 'control-manager'. The provisioning code is in:
+        openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml
+    it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but 
+    actual generation can be skipped and old files re-used to generate secret. 
+    * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards.
+    So, this is update is not a problem (or at least not a sole problem). 
+ 
+ - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using
+            etcdctl3 member list
+    * The last command is actually bash alias which executes
+        ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
+    Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the 
+    second port got problems. I was trying to change 2379 to 2380 in command above and it was failing.
+    However, it does not work either if the cluster in healhy state.
+    * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation 
+    fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue
+    as problem persist even if we keep certificates intact. However, I also verified that newly generated 
+    certificates are completely similar to old ones and containe the correct hostnames inside.
+    * Last idea was that actually 'asb-etcd' is broken. It complains 
+        2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+    However, the same error is present in log directly after install while the cluster is completely 
+    healthy.
+    
+ - The networking seems also not an issue. The configurations during install and upgrade are exactly the same.
+ All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved) 
+ by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves
+ back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.
+ 
+  
+\ No newline at end of file
author	Suren A. Chilingaryan <csa@suren.me>	2018-03-11 19:56:38 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2018-03-11 19:56:38 +0100
commit	f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree	3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs
parent	6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
download	ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2 ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip