As a Forest Admin customer, you are likely to have been affected by incidents that occurred on October 8-9-13.
Forest Admin is deeply sorry for the impact it may have had on your operations and business.
Such events are very rare, but when they happen they hurt a lot for every part of the chain (hosting provider, service providers, end users). Probably better than apologies, explanations and future actions.
Incidents Summary
On October 8-9, Forest Admin experienced 2 similar incidents coming directly from Forest Admin hosting provider (Heroku) for a total of 6 hours and 49 minutes of downtime, and 7 hours and 34 minutes of degraded services.
On October 13, an automated planned Heroku maintenance on Sunday morning (at ~4:00am UTC+2) of one of our Redis instances led to errors on the US area platform (the new platform the technical team had set up in emergency mode on October 9 to recover faster than Heroku). It took Forest Admin 4 hours and 21 minutes to acknowledge the incident (at 9:24am) and 34 minutes to resume the service. It led to an additional ~4 hours and 55 minutes of downtime, and 3 minutes of degraded services.
Incidents Details
The 2 first incidents came from routing failures from Forest Admin hosting provider (Heroku).
On October 8, during the first incident, the Forest Admin technical team took a few minutes to investigate if the incident was coming from recent changes on the Forest Admin side. But nothing relevant was found, so we quickly suspected our hosting provider. In the meantime, the Forest Admin tech team did its best to communicate as transparently as possible on the evolution of the incident. As soon as Heroku publicly acknowledged their incident, the Forest Admin technical team monitored Heroku’s status page, while thinking about ways to bypass the issue. Heroku finally resumed their services after 5 hours and 28 minutes of downtime:
- Forest Admin report of incident #1 (October 8)
On October 9, as soon as the Forest Admin technical team detected the incident replica (~1 hour before Heroku), as the first one apparently had only impacted the EU area, it was decided to deploy the Forest Admin services in the US area. It was a good move to resume the service quickly without relying solely on Heroku’s response time.
However, DNS propagation takes time (historical TTL configuration of 3 hours for the api.forestadmin.com record) and the time to restore the services was delayed:
- Forest Admin report of incident #2 (October 9)
As a temporary state, the Forest Admin Tech team spent the weekend with the platform now hosted in the US area (and the historical one, hosted in the EU area, turned off).
On October 13, a new incident started; a side effect due to the migration to the US area (temporary emergency configuration done on October 9). A planned automated Heroku maintenance upgraded one of our Redis instances without updating the related environment variable containing the new Redis URI. This event led to errors on the Forest Admin Authentication API. The errors started early in the morning (5:03am UTC+2) impacting mainly asian customers and the service resumed at 9:58am UTC+2 by switching back to initial EU area platform:
- Forest Admin report of incident #3 (October 13)
Since Sunday October 13, at 9:58am UTC+2, The platform uptime is 100%.
Current Situation
The Forest Admin platform is now back in the EU area. The technical team now has the ability to switch, within a few minutes, from one Heroku area to the other, in case of issues in a specific area (we have reduced the TTL to 5 minutes to be able to act quickly and switch in case of other similar emergencies).
This procedure is documented internally to process efficiently if such a similar incident had to occur again.
At this stage we are still entirely dependent on Heroku’s stability, and such downtime/degraded service times, without explanations, are clearly not acceptable.
Root Cause Analysis
About the root cause of these 3 incidents, Forest Admin is still waiting for official communication from our hosting platform Heroku.
At this stage, it’s not clear what led to the downtime in the Heroku EU area (where Forest Admin servers were hosted). The only thing Forest Admin knows, via a scheduled email communication received on October 10, is that Heroku had planned a maintenance on network traffic performance:
Starting October 10, 2024, we are adding global edge network capabilities to the Heroku Common Runtime to improve network traffic performance for all regions. This change will improve app performance for all Common Runtime customers.
This change will be automatically applied and requires no setup on your part, and it will update the default public IP addresses for all Common Runtime apps.
Forest Admin suspects the incidents to be unexpected side effects from this maintenance.
Forest Admin interacts with the Heroku support team and follows up on a daily basis, to obtain their RCA (Root Cause Analysis), and expecting answers concerning:
- the origin of the initial issue,
- the issue replica the day after,
- the very slow reaction time (2 days in a row),
- the really poor emergency communication.
Forest Admin will also expect their engagement to improve:
- their internal process to prevent such similar issues in the future
- their monitoring systems to detect issues before their own customers,
- the emergency support official communication (frequency, quality, transparency).
Prevention for the future
As mentioned above, Forest Admin implemented and documented a procedure to switch, within a few minutes, from one Heroku area to the another, in case of issues in a specific area.
It will make our Technical Team way more efficient to resume the service if another similar incident on the Heroku platform had to occur. It was probably the first and most important one, in more than 8 years of collaboration.
Now that the emergency mode is over, Forest Admin is currently thinking about hosting evolutions.
The main ideas are:
- switching to a more established hosting service,
- implementing a platform failover strategy,
- considering an external audit of our platform, from a stability perspective.
That being said, Forest Admin is not yet at the stage where something has been decided and, thus, cannot communicate clearly the direction for the coming months.
In the meantime, Forest Admin is currently in the process of becoming SOC2 compliant (including a Pentest finalized in September 2024).
The Forest Admin technical team will include the switch procedure from the EU area to the US one (mentioned above) in our “disaster & recovery plan” (required for SOC2 compliance).
A review of this process (among others) is scheduled by our SOC2 auditors by mid-December 2024.
Quality
The quality of service (availability, performance, stability) has always been very very important to Forest Admin. We know that our platform is essential to hundreds of companies for their business-critical daily operations.
Despite these recent events, we feel it’s important to provide a broader view of our uptime:
- Last 365 days: 99,9306% (~6 hours of downtime)
- All time uptime (1281 days): 99.9609% (~12 hours of downtime)
Rest assured that we continue to aim for an uptime of 100%.
Our team remains fully available to answer your additional questions or discuss collaboration evolutions.
Thank you for your patience and support during these difficult times.
Related Incident Reports
-
Heroku report of incident #1 (October 8)
-
Forest Admin report of incident #1 (October 8)
-
Heroku report of incident #2 (October 9)
-
Forest Admin report of incident #2 (October 9)
-
Forest Admin report of incident #3 (October 13)