Timeouts from the Authentication API

Hi.

All our ForestAdmin instances have been timing out from Aug 31st 23:30 CEST to Sep 1st 00:40 CEST. I could replicate the issue locally, and could confirm this was not coming from our setup. ForestAdmin was effectively down for us during that hour.

The errors thrown by the forest middleware (thrown by openid-client) suggest that some of your authentication endpoints weren’t responding quickly enough during that hour. I’m assuming these were your /oidc endpoints. Has anyone else experienced this issue?

If this is confirmed, could someone from the ForestAdmin team please explain what happened and what is measured by the Authentication API service status check? Happy to help investigate the issue further (e.g. with more precise timestamps) if needed.

Failure Logs

[forest] 🌳🌳🌳  Unexpected error: Timeout awaiting 'request' for 3500ms
{
  "name": "TimeoutError",
  "code": "ETIMEDOUT",
  "timings": {
    "start": 1630449325255,
    "socket": 1630449325255,
    "lookup": 1630449325256,
    "error": 1630449328773,
    "phases": {
      "wait": 0,
      "dns": 1,
      "total": 3518
    }
  },
  "event": "request",
  "stack": "RequestError: Timeout awaiting 'request' for 3500ms\n    at ClientRequest.<anonymous>"
}

Context

{
  "liana": "forest-express-sequelize",
  "liana_version": "7.12.2",
  "stack": {
    "database_type": "postgres",
    "engine": "nodejs",
    "engine_version": "14.17.5",
    "orm_version": "5.22.4"
  }
}

Hello @ridem,

Thanks for all the details. Do you know which endpoint on your agent experienced these timeouts?

Hello,

I believe it was mostly the callbacks ones: /authentication/callback?..., the ones that end up calling your authentication endpoints back.

How many timeouts did you experience?

Basically everything was timing out during that timeframe. 99th percentile response time of 30s, which was our max. This is an issue I could confirm locally too back then.

Could it be that you were affected by Heroku Status ?

Thanks for pointing out this issue at Heroku. I checked our logs and the server was responding very fast at this time, to the requests it was receiving.

I can see an error on one of our system that is using the same authentication process than your agents, and the problem does not appear in our backend, so it can be definitively related to a connectivity issue (either our backend did not receive the request, or your agent did not receive the response).

I’ll investigate to see why this problem does not appear on the status page, as we are checking the authentication process just as you are using it.

Ok, thanks for looking into this!

I was getting these errors too locally on my personal internet connexion, which is why I was surprised to see the status check being green there too. Betteruptime didn’t even capture elevated response times? openid-client seems to have configured got to throw for anything above 3.5s.

Both /forest/authentication/callback? and /forest/Model? timed out. Just looked at the logs, first timeout was at Aug 31 23:35:56, last one at Sep 01 00:35:32. Did you notice a drop in throughput during that timeframe?

In case this helps, these are the two types of errors thrown by forest during that period:

[forest] 🌳🌳🌳  Unexpected error: Cannot read property 'body' of undefined
{
  "stack": "TypeError: Cannot read property 'body' of undefined\n    at AuthorizationFinder._callee$ (/app/node_modules/forest-express/dist/services/authorization-finder.js:112:242)\n    at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:63:40)\n    at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:294:22)\n    at Generator.throw (/app/node_modules/regenerator-runtime/runtime.js:119:21)\n    at asyncGeneratorStep (/app/node_modules/forest-express/node_modules/@babel/runtime/helpers/asyncToGenerator.js:3:24)\n    at _throw (/app/node_modules/forest-express/node_modules/@babel/runtime/helpers/asyncToGenerator.js:29:9)"
}
[forest] 🌳🌳🌳  Unexpected error: Timeout awaiting 'request' for 3500ms
{
  "name": "TimeoutError",
  "code": "ETIMEDOUT",
  "timings": {
    "start": 1630447473321,
    "socket": 1630447473322,
    "lookup": 1630447473349,
    "error": 1630447476823,
    "phases": {
      "wait": 1,
      "dns": 27,
      "total": 3502
    }
  },
  "event": "request",
  "stack": "RequestError: Timeout awaiting 'request' for 3500ms\n    at ClientRequest.<anonymous> (/app/node_modules/got/dist/source/core/index.js:956:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:470:12)\n    at ClientRequest.origin.emit (/app/node_modules/@szmarczak/http-timer/dist/source/index.js:39:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:470:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/app/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)"
}

Here is the result of our investigation:

Between 21:20 UTC and 22:50 UTC our api experienced connectivity issues, caused by incidents on heroku.

Depending on the moment, between 25% and 50% of requests timed out because they could not reach our servers.

Our monitoring tools, visible at https://status.forestadmin.com recorded random timeouts, but the rule we used to detect incidents did not allow them to appear on the status page.

In order to avoid such detection issues in the future, we changed the configuration to detect incidents more accurately.

2 Likes

Thanks a lot for the thorough investigation Guillaume!