Delivering Toggl Shared Authentication
Last month, Toggl Hire was the last product joining the family of Toggl products using our shared authentication system. We finally provide a seamless experience across all Toggl services, using a single set of credentials. Or better to say, we made the first step to create a seamless experience across all Toggl products. This is the story about the journey that led to this release.
Historically, we operated as fully separate and independent products, as well as separate teams. There are multiple reasons for this choice, but this approach allowed us to stay flexible and experiment independently for a long time, at different scales.
In January 2023, we started planning for a significant leap forward in enhancing user experience by rolling out a unified authentication system, internally named Shared Auth. Not only a significant leap, but Shared Auth was also a giant project that, since the beginning, we were aware would have required complex coordination as well as the involvement of multiple teams.
Our Goal
The primary goal of the project was to get rid of the burden of managing multiple accounts for different Toggl products, providing a single access experience to all users. We decided to start from Track and Plan, our biggest products, and later allow the other products to integrate into Shared Auth.
A fundamental requirement was also to ensure zero-downtime, because at our scale it is really expensive to put the entire platform in maintenance mode and block users access. No mistakes were allowed then, extreme coordination was required to go live and offer the new functionality to all the users, taking into account all the edge cases that could arise from such a migration.
In order to allow users to log in and sign up with different methods, manage their password, and potentially their account, all of this through the same interface, we had to centralize all of our users into the same place.
Hence, the main technical challenge and goal was to aggregate users from different systems and databases into a single service to be consumed by all Toggl products. This goal had to be met taking into account synchronization required across different products using different cloud providers and different technology stacks.
Architecture
When designing an authentication system, there are a few questions to be addressed, before talking about any technology of choice. First of all, how is the session generated? How is the session stored, if it is at all? Which cryptography method to ensure confidentiality and integrity? In our case, how is the session transmitted to authenticate against multiple backend services?
Our products are served as subdomains of the root domain. For instance Track is served at track.toggl.com
, while Plan is served at plan.toggl.com
. This is valid for both the frontend applications and the backend, that is reached on the same subdomain on a specific API path. This implied that a valid session must be available domain wide at toggl.com
, to be available to all of our applications.
On the browser, we soon figured out that secure cookie was the best way to store a domain-wide valid session. The Secure attribute technically protects only the cookie's confidentiality, although an active network attacker can overwrite Secure cookies from an insecure channel, disrupting their integrity. We didn’t consider this a problem for a few reasons: our traffic is always encrypted, we set the cookie as HTTP-only so that JavaScript has no access, as well as up to date browsers do not allow to set the Secure directive for insecure sites (not HTTPS protected).
On our native apps, we relied on OAuth on its PKCE variant to prevent CSRF and authorization code interception attacks. The OAuth setup returns both an access token with a one-hour lifetime as well as a refresh token with a 1-month lifetime. The refresh token can be used to obtain a new pair of access and refresh tokens (meaning that we have implemented rotating refresh tokens for extra security). This setup guarantees that you can continue using the Toggl native apps without needing to log again as long as you interact at least once a month with them.
Hence, we defined that our shared authentication platform would have been served on accounts.toggl.com
, that is exactly where users provide their credentials and the secure cookie is set to be valid domain wide.
But, what does the cookie contains? We explored the countless existing solutions, but eventually decided to use the most widely adopted JWT (JSON Web Token) standard as the format for session credentials, so that Toggl products can validate them without having to make synchronous request to the authentication service every time. Paired with JWT, we also employed JWK (JSON Web Key Set) for sharing keys so that individual Toggl products could validate session credentials. JWK is a set of keys containing the public keys used to verify any JWT issued by the server, and in our case the signing is done with the Ed25519 algorithm which provides a number of useful properties.
And yes, no need to point to tweets claiming how bad JWT is. Love and respect for cryptographers, we are aware of the risks and were careful about properly implementing the token verification, the session storage and invalidation, indeed we don’t store any personal data, such as name or email, within JWT.
After picking up the design elements of the authentication system, we were ready to decide the technologies to implement the solution.
On the frontend side, we decided to use Remix, protagonist of one of our recent blog posts. Remix is defined as a full stack web framework, specifically a SSR framework for React that got a lot of traction and popularity after wide adoption from Shopify.
On the backend side, not many doubts using Go for the backend service and Postgres as database, our backend stack in the greatest majority of our services. We also leveraged PubSub, another widely used component in our architecture, for inter-backend communication, which we employed for communicating user changes and revoked session events.
Development and Migration
The development and migration to Shared Auth involved meticulous coordination across multiple teams and products. Representatives from each Toggl team, including Track and Plan, collaborated closely to design, develop and deliver the solution. Overall, more than 10 teams were involved and almost 20 stakeholders took part in the project.
The setup described above was quite unusual for our company, but required by the size of the project. We named this group of stakeholders the Auth Taskforce. Stakeholders joined from engineering, product, design and management departments of both Track and Plan products, composing a fully cross-functional team able to manage the entire lifecycle of the project.
A dedicated Slack channel was created for the project in order to facilitate alignment and focused discussions. The team also had weekly meeting with rotating hosts, to foster ownership and comprehensive understanding among team members.
Each team representative was responsible for the definition of the testing flows required to validate the correctness of the solution. The testing flows included signup and logout via 4 different authentication methods within 10 different applications, as well as account edit and closure functionalities between both Track and Plan products. This extensive testing ensured that all potential issues were identified and addressed before the full rollout. For instance, representatives from native clients verified that OAuth flow was working correctly, they could sign up, log out, close their account, and so on. On top of that, each representative also actively took part in testing general flows cross-platform.
The taskforce started setting up a dedicated staging environment for the new shared auth system. Track and Plan backends were also provisioned in this environment, incrementally adapting to the new system. In the same way all clients were able to target this dedicated environment and test the new flows.
Challenges
As mentioned above we decided to start with the 2 main Toggl products per usage metrics. Both products have different technology stacks, frontend and backend wise. They were operating as separate companies up-to shortly before starting with this project which carried the additional challenge of a lack of cooperation culture between these products teams.
In addition to their backends, both had native and webapp clients that needed to be adapted to work with the new model. We also defined some constraints for the delivery:
- We had to avoid logging-out everyone at once on release date (to reduce disruption to end users) but allowing a transition period were old session credentials would still be recognized as valid
- We didn’t have to affect profile and session endpoints functionality for some months after roll-out, transparently proxying their requests to the new system. This was to reduce disruption to third party scripts and integration that depended on the old endpoints
Another source of edge cases were users with matching emails in both products. Indeed, we wanted to migrate the source of truth for user accounts from the individual Track and Plan databases to the new unified authentication system in the most transparent way. We had to take into account the around 20,000 users sharing both products. The tricky challenge is that both Track and Plan allow users to start using the product without verifying the email, hence we didn’t know beforehand if users with matching emails represented the same person.
We introduced a solution we named “Merge Conflict Resolution”, a mechanism were each time one of those users tried to log in into the new system they would be presented with a screen telling them a new unified auth system was put in place and to please use a link that we had sent to their email in order for them to set a new password. On top of making users aware of this change, it had two other benefits: first, we could verify the user’s ownership of the email; second, we could let the user choose if they wanted to merge the two accounts in the first place. Alternatively to clicking the link in the email, users also had the option to change the email of the account they were in, in case it wasn’t valid or they couldn’t access it, or they simply didn’t want the two accounts connected.
Release
Zero Downtime Strategy
The requirements for implementing a zero downtime release were:
- Implement transparent proxying of endpoints in both Track and Plan backend so that they would resolve the request by contacting the new endpoints in Shared Auth
- Allow a transition period where the old session credential would still be recognized
In the new system, using JWT tokens, the ID of each user and the products they have access to is encoded in the token. In this way, when either Track or Plan backends needed to validate the new session credentials they only had to check whether:
- The product was included in the list encoded in the token
- The user ID was known, otherwise make a one-time request to Shared Auth
- The JWT was properly signed with the public keys, based on JWK ****mechanism.
Those JWK public keys are cached by Track and Plan, then only one request to Shared Auth once every 30 minutes is required to check whether they changed.
We also ensured the independence of Track and Plan, allowing them to process the authentication for the vast majority of requests without having to make synchronous calls to Shared Auth system. It is really important to notice, in this way we take out that component as a critical one that could drive the uptime for all Toggl products down if temporarily unavailable.
Given that Track and Plan already had a local concept of a user, when we changed the source of truth for these users to the new authentication system we just reused these local records as a cache of the new source of truth that were updated by listening to a PubSub queue. This way, there was a surgical insertion of the interaction needed between those pre-existing backends and the new system; where the rest of their codebases could continue working seamlessly consuming from the local record without worrying about the existence of a new system.
Scheduled Steps
The taskforce defined 9 required steps to handle the migration:
- Transition Time**:** all stakeholders agreed on a specific time for changing the source of truth to Toggl Accounts
- Set transition time: the transition time was specified in the new and old system to determine when the new behavior should start
- Deploy Toggl Accounts: no new signups or interactions with existing accounts will be possible until the time defined on step 1
- Deploy Toggl Track: cleanup any existent references to previous accounts from the database, and also set the Transition Time after which requests will start being proxied
- Clean-up Toggl Accounts production database: manually delete all users created during the testing phase
- Wait for the Transition Time: no new signups accepted after the previous step, nor migrating any accounts, nor having any role until the time defined
- Monitoring: multiple resources including Grafana and Google Cloud dashboards and Sentry for errors
- Frontend release: greenlight for Track and Plan frontend to release their changes
- Monitoring (again): the same resources above, making sure users are able to access through the new system, as well as the Merge Conflict Resolution mechanism was properly working
Outcome
The transition to Shared Auth was successfully executed with minimal disruption.
Given the amount of testing that went into this project and that we had a representative from each team, we could cover well the vast majority of scenarios. The fact that we allowed a transition period both for old session credentials as well as for old endpoints reduced the surface area for surprises.
On launch day we saw some problems with the old native apps when trying to authenticate with Google/Apple because they were pointing to the old Track endpoint which in turn was transparently proxying to Shared Auth. For this situation we decided just to turn off the automatic proxying to Shared Auth, allowing Track to be the source of truth again for those auth mechanisms. This mechanism made sense because Track was (up to that point) the only Toggl product allowing Google/Apple access, so it had enough data in its database to resolve those requests.
Before the rollout we had also put in place a process in the new auth mechanism that would inform Track about any Google/Apple ID that was registered in the unified system. A couple of days later we fixed the problem and Track old endpoint returned to just proxying the requests to the new system, as intended.
On launch day we also saw a race condition that affected a small portion of users signing up through invitations. For these users, we didn’t have to create a personal Track workspace, because they were already joining an existing one. Taking into account how we were processing events in Track from PubSub, we could be creating a local record with the personal Track workspace before checking whether the user was signing up due to an invitation. So, for some days, some users got a personal workspace, which created mild confusion, until we fixed the race condition and cleaned-up the workspaces created due to this problem.
Nine months after the launch of Shared Auth we cleaned up the deprecated endpoints as planned.
Flash Forward & Conclusions
Almost one year later, we successfully integrated Work, our new expense management and reporting platform, that relied on Shared Auth from the get go, as well as Hire, that recently migrated all their users to use the unified system.
Following the successful implementation of Shared Auth, we continue to work on integrating our products. This project marks a significant step towards interoperability and sets a strong foundation for future enhancements.
This was only possible thanks to the effort put forward by the involved stakeholders who represented all teams within the company. Thank you for your wide contributions delivering this project.