K8's pods failing liveness probes

Brett_Belka · January 5, 2024, 8:46pm

Feature(s) impacted

Observed behavior

We are having issues with our FA pods failing liveness probes. We are seeing numerous pod restarts daily due to this. I’m wondering if there’s anything that we should be aware of that is characteristically seen in FA backends that can cause this to happen - even maybe our liveness expectations are too high…?

Expected behavior

Failure Logs

We aren’t really seeing anything in the logs that is unusual - there are usually an eleveated number of errors around the time of the probe failure but the errors themselves are not just normal, mostly user-input-related errors.

Context

Project name: scrathc-payment-service
Team name: All
Environment name: staging, staging-p2, production
Agent (forest package) name & version: “forest-express-sequelize”: “^9.3.9”,
Database type: mysql & postgres
Recent changes made on your end if any: We are in the process of an infra migration which is what brought it to our attention but we are seeing it in the old cluster as well and it has been happening for a long time. We had just never had it brought to our attention. We have long had issues with FA stalling and needing a refresh or timing out on fetch requests. We had always blamed it on queries on un-indexed data. Better indexing and query optimization has helped our cause but we still see fetch failures and failing pods seems to be the issue.

ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10

Brett_Belka · January 5, 2024, 8:49pm

Also suggested by my OPS counterpart: here’s our Dockerfile - if you have any feedback there…

FROM node:20-alpine

RUN apk add --update supervisor && rm  -rf /tmp/* /var/cache/apk/*

ADD docker-config-files/supervisord.conf /etc/

WORKDIR /var/www/sps-fa.scratchpay.com

COPY package*.json ./

ENV NPM_CONFIG_LOGLEVEL warn
RUN npm set progress=false
RUN npm config set registry https://registry.npmjs.org/
RUN npm update -g npm

RUN npm install lumber-cli -g -s
RUN npm install -s --production

COPY . .

EXPOSE 8080

ENTRYPOINT ["npm", "start"]

Brett_Belka · January 8, 2024, 8:15pm

We’ve continued to do more research on this and it appears to be caused by Reached heap limit Allocation failed - JavaScript heap out of memory. (This error wasn’t being logged for some reason on our old deployment but we’re seeing it now with much better infra all around, leading to better logging).

Has anyone else seen this issue or is it more likely some sort of memory leak? The symptoms of this issue have existed for quite some time and many versions of both node and forest-express-sequelize.

Alban_Bertolini · January 10, 2024, 10:42am

Hello Brett,
Sorry for the late response.
Are you still blocked ?
Do you have the same error on a development environment or it’s only related to k8 with staging and prod ?

Brett_Belka · January 19, 2024, 5:53pm

Hey, @Alban_Bertolini!

Sorry for the delay in response!

We have “fixed” the Reached heap limit Allocation failed - JavaScript heap out of memory issue. We increased --max-old-space-size in the node options and also increased the memory allocation for the pods. The pods are not coming even close to using all of the memory and the node error is gone.

The pods only crash on authentication. Once a user is authenticated, it’s flawless but when someone logs in or changes teams, a pod crashes and restarts. On the user side, this often results in what appears to be a failure to login but then refresh page and all is well. (though I wasn’t able to reproduce this while I’m writing this).

Error log:

➜ k exec -it forest-admin-rc-web-5b46f4475-n724x /bin/sh
~/.npm/_logs # tail -f *

==> 2024-01-18T20_09_22_674Z-debug-0.log <==
22 verbose argv "start"
23 timing npm:load:setTitle Completed in 2ms
24 timing npm:load:display Completed in 1ms
25 verbose logfile logs-max:10 dir:/root/.npm/_logs/2024-01-18T20_09_22_674Z-
26 verbose logfile /root/.npm/_logs/2024-01-18T20_09_22_674Z-debug-0.log
27 timing npm:load:logFile Completed in 23ms
28 timing npm:load:timers Completed in 0ms
29 timing npm:load:configScope Completed in 0ms
30 timing npm:load Completed in 157ms
31 silly logfile done cleaning log files
command terminated with exit code 137

Alban_Bertolini · January 23, 2024, 10:29am

Hello,
Thank you for the feedback
Do you know why the agent only crashes when trying to authenticate? Did you find more logs?
Do you still have this problem?

Brett_Belka · January 23, 2024, 3:10pm

I’ve provided all of the logs from the pod that we get in my most recent post.

Do you know why the agent only crashes when trying to authenticate?
No clue. We were hoping there may be something on your end that you could do to check that. AFAIK, we don’t don anything to manage authentication in our code.

Do you still have this problem?
Yes.

The login does succeed but the pod that was hit to process the authentication crashes. Front end just does nothing (infinite load). Then you can refresh the front end, it connects to a different pod, and everything proceeds as normal.

Alban_Bertolini · January 24, 2024, 8:53am

You probably have a running agent configured for another environment, which is why your authentication sometimes doesn’t work.
When the authentification does’t work, can you send your client-id ? You can find it in your network, in the query params of the auth request.

Brett_Belka · January 24, 2024, 9:10pm

The authentication does work, though. It just crashes the pod.

We hit the auth route. Nothing happens in the UI (eternal load). Refresh. We’re logged in and everything loads.

Dm’d client_id

Alban_Bertolini · January 26, 2024, 12:59pm

Hello again,
Did you try to add “/forest” instead to “/” to check the liveness of your pod?

Topic		Replies	Views
Agent unreachable. Forest Admin can't reach your server Help me!	18	159	September 23, 2024
Running out of memory Help me!	2	466	March 16, 2023
Regular authentication errors, especially when app traffic is high Help me!	10	331	August 15, 2021
Authentication suddenly leads to error Help me!	6	57	October 9, 2024
Timeouts from the Authentication API Help me!	8	484	September 1, 2021