Updates from June, 2018 Toggle Comment Threads | Keyboard Shortcuts

  • Serdar Osman Onur 7:19 am on June 21, 2018 Permalink | Reply
    Tags: , , ,   

    OpenShift POD in CrashLoopBackOff State

    *OpenShift V3.6
    From time to time PODs in an OCP cluster can be stuck in CrashLoopBackOff state. There are various reasons for this. Here I will talk about an exceptional case to be stuck in this CrashLoopBackOff state.

    I opened a support ticket about this and I had a remote session to solve the problem together with a Red Hat support personnel.

    The thing was, somehow, at some point, for an unknown reason (possibilities are network issues, proxy issues etc.), this exceptional state was created and the node that this pod was being scheduled to did not get the COMPLETE IMAGE to be used for this deployment. There was a missing layer! Once that missing layer was manually pulled inside the failing NODE, the problem was gone and the POD was up & running again.

    There are 2 things to be done after SSHing to the target NODE.
    1- Login to the DOCKER REGISTRY
    docker login -u admin -p $(oc whoami -t) docker-registry.default.svc:5000

    2-Manually pull the image
    docker pull docker-registry.default.svc:5000/tybsdev/yazi-sablon-arayuz

    In step 2 you will see the missing layer being pulled from the registry.

     
  • Serdar Osman Onur 12:17 pm on June 6, 2018 Permalink | Reply
    Tags: , , ,   

    OpenShift – Basic Deployment Operations 

    Starting a deployment:(start a deployment manually)

    Viewing a deployment: (get basic information about all the available revisions of your application)

    Canceling a deployment: (cancel a running or stuck deployment process)

    Retrying a deployment: (retry a failed deployment)

    Rolling back a deployment: (If no revision is specified with --to-revision, then the last successfully deployed revision will be used)

     

    https://docs.openshift.com/container-platform/3.6/dev_guide/deployments/basic_deployment_operations.html

     
  • Serdar Osman Onur 6:33 am on June 1, 2018 Permalink | Reply
    Tags: , ,   

    OpenShift – All Compute Nodes are in NotReady State 

    I am having a problem with my cluster. I have 2 compute nodes and none of them are working.
    When I do “oc describe node node_name” I get the attached outputs for the 2 nodes.

    In the Events part the following caught my attention:

    1)
    NODE1:
    Type: Warning
    Reason: ContianerGCFailed
    Message: rpc error: code = 4 desc = context deadline exceeded

    2)
    NODE2:
    Reason: SystemOOM
    Message: System OOM Encountered

    Reason: ImageGCFailed
    Message: unable to find data for container

    Below are the “describe” outputs from the master for both compute nodes.

    describe-node1-28.05.2018

    describe-node2-28.05.2018

    I also attached the “sos reports” for both nodes. Below is the answer from Red Hat support.

    After some investigation, I figured It was a docker service problem caused by limited RAM resources. The problem was fixed by increasing the RAM for both compute nodes.

     
    • Serdar Osman Onur 6:34 am on June 1, 2018 Permalink | Reply

      Red Hat Support:

      Thank you for contacting Red Hat Support.

      I can see below messages in the logs :

      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.071759 101303 kuberuntime_manager.go:619] createPodSandbox for pod “fikri-hak-yonetimi-1-mv312_tybsdev(34e7c705-61f5-11e8-a82a-0050569897ab)” failed: rpc error: code = 4 desc = context deadline exceeded
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.071794 101303 pod_workers.go:182] Error syncing pod 34e7c705-61f5-11e8-a82a-0050569897ab (“fikri-hak-yonetimi-1-mv312_tybsdev(34e7c705-61f5-11e8-a82a-0050569897ab)”), skipping: failed to “CreatePodSandbox” for “fikri-hak-yonetimi-1-mv312_tybsdev(34e7c705-61f5-11e8-a82a-0050569897ab)” with CreatePodSandboxError: “CreatePodSandbox for pod \”fikri-hak-yonetimi-1-mv312_tybsdev(34e7c705-61f5-11e8-a82a-0050569897ab)\” failed: rpc error: code = 4 desc = context deadline exceeded”
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.128803 101303 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.128875 101303 kuberuntime_sandbox.go:54] CreatePodSandbox for pod “dosya-yonetimi-arayuz-1-zwjbz_tybsdev(331ac283-61f5-11e8-a82a-0050569897ab)” failed: rpc error: code = 4 desc = context deadline exceeded
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.128893 101303 kuberuntime_manager.go:619] createPodSandbox for pod “dosya-yonetimi-arayuz-1-zwjbz_tybsdev(331ac283-61f5-11e8-a82a-0050569897ab)” failed: rpc error: code = 4 desc = context deadline exceeded
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: E0528 00:31:22.128929 101303 pod_workers.go:182] Error syncing pod 331ac283-61f5-11e8-a82a-0050569897ab (“dosya-yonetimi-arayuz-1-zwjbz_tybsdev(331ac283-61f5-11e8-a82a-0050569897ab)”), skipping: failed to “CreatePodSandbox” for “dosya-yonetimi-arayuz-1-zwjbz_tybsdev(331ac283-61f5-11e8-a82a-0050569897ab)” with CreatePodSandboxError: “CreatePodSandbox for pod \”dosya-yonetimi-arayuz-1-zwjbz_tybsdev(331ac283-61f5-11e8-a82a-0050569897ab)\” failed: rpc error: code = 4 desc = context deadline exceeded”
      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: I0528 00:31:22.833265 101303 kubelet_node_status.go:410] Recording NodeNotReady event message for node tybsrhosnode02.defence.local

      May 28 00:31:22 tybsrhosnode02.defence.local atomic-openshift-node[101303]: I0528 00:31:22.833303 101303 kubelet_node_status.go:717] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-05-28 00:31:22.833243851 +0200 EET LastTransitionTime:2018-05-28 00:31:22.833243851 +0200 EET Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m9.360485732s ago; threshold is 3m0s}

      It looks like your docker is either using more resources or is in hang state when this issue is observed.

      My recommendation would be upgrade to latest docker1.12 version.

      Also, when you see the issue again check if you are able to perform docker ps or any docker commands on the system.

      provide output of below command in the mean time to check if docker socket returns data ?

      curl –unix-socket /var/run/docker.sock http:/containers/json | python -mjson.tool

      if mjson tool is not installed then,

      curl –unix-socket /var/run/docker.sock http:/containers/json

      # gcore -o /tmp/dockerd_core $(pidof dockerd-current)
      # gcore -o /tmp/containerd_core $(pidof docker-containerd-current)

      • Serdar Osman Onur 6:10 pm on June 11, 2018 Permalink | Reply

        In the end, the problems were gone away after I increased the RAM and CPU dedicated to my compute nodes. Then I asked this:

        Is there a guideline to handle such cases where a node becomes NotReady?

        Is there a list of first steps to take in such situations for a quick diagnosis?

        Another question. I feel like my nodes are consuming RAM aggressively and RAM usage sometimes result in nodes being NotReady. How can I check if something is wrong with the RAM consumption of my nodes/pods?

        Red Hat Support
        —–
        I feel like my nodes are consuming RAM aggressively and RAM usage sometimes result in nodes being NotReady.
        ———–

        — Yes, this could be one of the reason. Perhaps you can increase the RAM of the node according to your need.

        —-
        Is there a list of first steps to take in such situations for a quick diagnosis?
        ———-
        >> 1. Docker, atomic-openshift-node, dnsmasq daemon must be running. If any service from this failed, then the node can turn into NotReady.
        >> 2. Also, DNS configuration should be correct in place. For this, I am attaching one article[1].
        >> Ensure the disk and memory pressure is within a limit.

        — You can also limit the number of pods scheduling on the node. Also, configuring limit ranges[2] would help you to manage the resource utilization efficiency.

        —-
        How can I check if something is wrong with the RAM consumption of my nodes/pods?
        ———-
        — There is no concrete method but yes if we configure the limit ranges properly then the pods will never try to reach or exceed beyond the limits. Also, set the replica count for pods as per the need only.

        [1] https://access.redhat.com/solutions/3343511
        [2] https://docs.openshift.com/container-platform/3.6/admin_guide/limits.html#overview

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel