OpenShift Infra Node Stuck at NotReady State

openshift persistent volumes

Yesterday, I encountered an annoying problem in my OpenShift Cluster. I was trying to deploy an application. Everything was fine until OpenShift tried to push the image generated at the end of the build to the OpenShift registry.

Below is how I described the problem to Red Hat support:

I just ran my jenkins pipeline and everthing was success until I hit this problem. OpenShift failed to push the created image to the docker registry 6 times. Then stopped trying. When I checked the defaut namespace, I saw that the pod creation for docker-registry was failing. It says “no nodes are available that match all of the following predicates:: CheckServiceAffinity (2), MatchNodeSelector (2)”. I checked my 2 compute nodes and there are no disk space problems. I checked my infra node too, no problems there either. This automated deployment pipeline was working like 3 weeks ago, there has been no change since then and I don’t see why registry pod is failing now. I am attching some screenshots and output of df command in infra node. Also, I don’t know why but this pod, although dc says 1 replicas, still trying to scale up to 2 and then scaling down..

It turned out that the problem was originated from “openvswitch.service”. Service just could not start. Which in turn was caused by “ovs-vswitchd.service”. Short path to the result: It was because  “ovs-vswitchd.service” kept timing out while trying to start and which resulted at the end with “ovs-vswitchd.service start operation timed out. Terminating.” message (output of journalctl command).

I did some searching and digging around. The solution that I came up with was to add  TimeoutSec values to both “/usr/lib/systemd/system/openvswitch.service” and “/usr/lib/systemd/system/ovs-vswitchd.service”. Restarting them with “systemctl restart <service_name>” command finally got those services to active (running) state.

After rebooting the infra node, waiting for a couple of minutes, and running “oc get nodes” on the master node I got this:

tybsrhosinode01.defence.local Ready 189d v1.6.1+5115d708d7
tybsrhosmaster01.defence.local Ready,SchedulingDisabled 189d v1.6.1+5115d708d7
tybsrhosnode01.defence.local Ready 189d v1.6.1+5115d708d7
tybsrhosnode02.defence.local Ready 189d v1.6.1+5115d708d7

Which says it is all good now!

[[email protected] ~]# cat /usr/lib/systemd/system/ovs-vswitchd.service
[Unit]
Description=Open vSwitch Forwarding Unit
After=ovsdb-server.service network-pre.target systemd-udev-settle.service
Before=network.target network.service
Requires=ovsdb-server.service
ReloadPropagatedFrom=ovsdb-server.service
AssertPathIsReadWrite=/var/run/openvswitch/db.sock
PartOf=openvswitch.service

[Service]
Type=forking
Restart=on-failure
TimeoutSec=300
EnvironmentFile=-/etc/sysconfig/openvswitch
ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \
–no-ovsdb-server –no-monitor –system-id=random \
start $OPTIONS
ExecStop=/usr/share/openvswitch/scripts/ovs-ctl –no-ovsdb-server stop
ExecReload=/usr/share/openvswitch/scripts/ovs-ctl –no-ovsdb-server \
–no-monitor –system-id=random \
restart $OPTIONS

 

[[email protected] ~]# cat /usr/lib/systemd/system/openvswitch.service
[Unit]
Description=Open vSwitch
Before=network.target network.service
After=network-pre.target ovsdb-server.service ovs-vswitchd.service
PartOf=network.target
Requires=ovsdb-server.service
Requires=ovs-vswitchd.service

[Service]
Type=oneshot
TimeoutSec=300
ExecStart=/bin/true
ExecReload=/bin/true
ExecStop=/bin/true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Hope this helps.
Good Luck,
Serdar