1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
System
-------
2025.09.28
- ipekatrin1:
* Raid controller don't see 10 disks and behaves erratically.
* Turned of the server and ordered a replacement.
- Sotrage:
* Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
* Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
* Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.
2025.10.27
- ipekatrin1:
* Disconnected all disks from the server and start preparing it as an application node
- Software:
* I have temporarily suspended all ADEI cronJobs to avoid resource contention on ipekatrin2 (as restart would be dangerous now) [clean (logs,etc.)/maintain (re-caching,etc.)/update(detecting new databases)]
- Research:
* DaemonSet/GlusterFS selects nodes based on the following nodeSelector
$ oc -n glusterfs get ds glusterfs-storage -o yaml | grep -B 5 -A 5 nodeSelector
nodeSelector:
glusterfs: storage-host
All nodes has corresponding labels in their metadata:
$ oc get node/ipekatrin1.ipe.kit.edu --show-labels -o yaml | grep -A 20 labels:
labels:
...
glusterfs: storage-host
...
* Thats removed now from ipekatrin1 and should be recovered if we bring storage back
oc label --dry-run node/ipekatrin1.ipe.kit.edu glusterfs-
* We further need to remove 192.168.12.1 from 'endpoints/gfs' (per namespaces) to avoid possible problems.
* On ipekatrin1, /etc/fstab glusterfs mounts should be changed from 'localhost' to some other server (or commented all-together). GlusterFS mounts
should be changed from localhost to
192.168.12.2,192.168.12.3:<vol> /mnt/vol glusterfs defaults,_netdev 0 0
* All raid volumes be also temporarily commented in /etc/fstab
* Further configuration changes required to run node without glusterfs causing no damage to the rest of the system
GlusterFS might be referenced via: /etc/hosts, /etc/fstab, /etc/systemd/system/*.mount /etc/auto.*, scripts/cron
endpoints (per namespace), inline gluster volumes in PV (gloabl),
gluster-block endpoints / tcmu gateway list, sc (heketi storageclass) and controllers (ds,deploy,sts); just in case check heketi cm/secrets),
- Plan:
* Prepare application node [double-check before implementing]
> Adjust /etc/fstab and check systemd based mounts. Shall we do soemth with hosts?
> Check/change cron & monitoring scipts
> Adjust node label and edit 'gfs' endpoints in all namespaces.
> Check glusterblock/heketi, stange pvs.
> Google above other possible culprits.
> Boot ipekatrin1 and check that all is fine
* cronJobs
> Set affinity to ipekatrin1.
> Restart cronJobs (maybe reduce intervals)
* ToDo
> Ideally eliminating cronJobs all together for rest of KaaS1 life-time and replacing with continuously running cron daemon iside container
> Rebuild ipekatrinbackupserv1 as new gluster node (using disks) and try connecting it to the cluster
Hardware
--------
2024
- ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken.
2025.09 (early month)
- ipekatrin1: Replaced 3 disks (don't remeber slots). two of them was already once replaced.
- Ordered spare disks
2025.10.23
- ipekatrin1:
* Replaced RAID controller. Make attempt to rebuild, but disks are disconnected after about 30-40 minutes (recovered after shutoff, not reboot)
* Checked power issues: cabling bypassing PSU and monitoring voltages (12V system should not go bellow 11.9V). No change, voltages seemed fine.
* Checked cabling issues disconnecting first one cable and then another (supported mode, single cable connects all disks). No change
* Tried to imrpove cooling, setting fan speeds to maximum (kept) and even temporarily installing external cooler. Radiators were cool, also checked reported temperatures. No change, still goes down in 30-40 minutes.
* Suspect backplane problems. The radiators were quite hot before adjusting cooling. Seems known stability problems due to bad signal management in firmware if overheated. Firmware updates are suggest to stabilize.
* No support by SuperMicro. Queried Tootlec about possibility of getting firmware update or/and ordering backplane [Order RG_014523_001_Chilingaryan form 16.12.2016, Angebot 14.10, Contract: 28.11]
Hardware: Chassis CSE-846BE2C-R1K28B, Backplan BPN-SAS3-846EL2), 2x MCX353A-FCB ConnectX-3 VPI
* KATRINBackupServ1 (3-years older) has backplane with enough bays to mount disks. We still need to be able to put Raid-card and Mellanox ConnectX-3 board/boards with 2 ports (can leave with 1).
- ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem.
* No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.
* Setup temperature monitoring of RAID card, currently 76-77C
Software
--------
2023.06.13
- Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop)
- Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted.
|