Lessons learned from our recent outage
In the attempt of turning challenges into strengths, we take the opportunity to share the lessons we learned from the incident we recently suffered, that kept our authentication service unavailable for a few hours. This event, while disruptive, provided us with valuable insights and a clear path toward strengthening our systems and practices. In this post, we aim to share the lessons learned and the steps we are taking to enhance our resilience against future attacks.
The Incident
Users' data are not affected by the incident. The attack only affected the authentication service causing service interruption.
On 2024-02-26, we noticed a surge in traffic affecting our authentication service, the shared service allowing users to login and signup to our products. The amount of traffic was quite relevant, up to 8000 times the level of traffic we are used to sustain.
A DDoS attack was launched against our service, leading to a substantial service interruption. This incident marked one of the longest outages in our history, for this reason we decided to public share more details about it. The outage impacted user access functionalities while not entirely inhibiting platform access for already logged-in users.
It barely took us ten minutes to identify the root cause of this vulnerability as an unintentional misconfiguration of our cloud networking protection service, a critical component of our defense, not properly configured in the new service.
Our Response
Our team responded to the attack with urgency, deploying several strategies to mitigate its impact. We adjusted our infrastructure scalability, implemented geographic blocking to manage traffic, and engaged closely with our cloud provider, Google Cloud, to restore the necessary protective measures. These actions were crucial in managing the attack and eventually restoring normal service levels.
However, the incident highlighted areas for improvement, particularly in our observability and mastering of technologies in use. Indeed, compounding the lack of networking protection was our initial lack of awareness and observability into the specific behaviors and limitations of our authentication service. In particular, our authentication frontend, a Remix based server-side rendered application. Remix is a wonderful web framework we are adopting for a number of applications and we will keep investing into.
It turns out that our authentication frontend service could not properly handle the rate limiting strategy from our API, because enforcing a further rate limiting mechanism built within our API client. We could not identify such a behavior while load testing under reasonable traffic conditions, still definitely lower than the huge traffic we experienced. The outcome is that despite our efforts to scale the related service horizontally to face the surge of traffic, and actually mitigate the attack bringing traffic back to a reasonable amount, due to the conflicting mechanisms we failed restoring the service functionality for a few hours.
As mentioned above, it took us long to identify the rate limiting problem because we were missing proper observability into our service, including proper logging and tracing, that would have allowed us to trace the lifecycle of the requests and understand the unexpected behavior of the authentication service. We officially employed Remix at scale for the first time last August when we released our Shared Login last year. This aspect is clearly providing us actionable feedbacks in terms of the approach adopting and mastering new technologies, as well as enforcing well established practices across the stack.
Moving Forward
Reflecting on this incident, we recognize the importance of continuous learning and adaptation. Here are the key steps we are implementing to fortify our platform:
-
Enhanced Communication: We are improving our internal communication processes to ensure that critical information is shared promptly and effectively across all relevant teams
-
Comprehensive Service Acceptance Criteria: We will enforce the adherenece to a rigorous set of service acceptance criteria for all backend and frontend services, which will include enhanced observability measures. This approach will enable us to detect and address potential issues more proactively
-
Improved Infrastructure and Testing: Our commitment extends to upgrading our infrastructure to better handle increased traffic volumes and potential threats. This includes expanding our testing strategies to cover a wider range of scenarios, ensuring our platform remains resilient under various conditions
-
Status Page Overhaul: Transparency with our users is paramount. We commit to improve our status page to offer a more detailed and real-time overview of our services' status. This will provide our users with timely information regarding service interruptions and maintenance updates
The recent attack was a stark reminder of the evolving challenges in maintaining a secure and reliable platform. While it was a difficult time, it has catalyzed our efforts to strengthen our systems, processes, and team readiness.
At Toggl, we are committed to learning from this incident and emerging stronger, ensuring that we continue to provide the reliable service our users expect.