Immersive Labs has experienced rapid growth in recent years, and we work with some of the largest enterprise and government organizations in the world. As an engineering leader, one of the most interesting aspects of this growth has been managing the tension between the drive to deliver new and innovative products and features versus the need to focus on improving scalability and performance.
At the start of 2022, the balance had swung more towards product innovation, which saw us launch two exciting new products, along with delivering a major overhaul to the user experience and information architecture of the site. As we entered into the second half of the year, we wanted to swing the balance back towards scalability and performance, and decided the best way to achieve this was to embrace our One Team value by making performance improvement a team sport! 🤗
Bringing people together with a clear focus and a shared goal is when magic happens. And this was our aspiration when we organized our first Performance Week: a week where teams came together and focused solely on improving system performance.
Our product teams, platform and infrastructure teams, data engineering team, and key engineers from our cyber content team took part in Performance Week. To maximize engagement, we defined a broad scope of performance improvement areas for teams to explore:
- Frontend/UI performance: asset caching, code splitting, lazy loading, etc.
- Backend/API performance: code performance, query caching, orchestration, etc.
- Database performance: query and index optimization, row scanning, sharding, etc.
- Reporting/insights performance: data latency, caching, query optimization, etc.
- Dependencies: upgrading and optimizing third party dependencies
- Lab performance: Docker image optimization, EC2 instance types and generations, disk performance, etc.
- User experience (UX): synchronous and asynchronous user flows, expectation management, etc.
Given the relatively short time frame, we wanted to help teams hit the ground running and give them the best possible chance of making tangible progress. So, we took the time to prepare the following in advance:
- An initial backlog of known performance issues
- A defined set of user-focused service level indicators (SLIs) for each product
- A documented set of quality attributes (non-functional requirements) used to assess system performance under specific load conditions
- A curated list of collateral resources for optimizing the performance of our core technologies, such as AWS services, ElasticSearch, Redis, Sidekiq, MySQL, and more.
Although the primary goal of Performance Week was to meaningfully improve the overall system performance, we defined a number of additional goals that we hoped would embed a performance culture and mindset going forward:
- Performance: to identify and remediate the most impactful system performance issues
- Observability: to create dashboards providing insights into the health and performance of each product and domain
- Metrics: benchmarks for SLIs measured at the start and end of the week, leading to the creation of service level objectives (SLOs) with associated alerts (monitors)
- Backlog: to create a backlog of longer-term performance improvements, including larger architectural changes and major refactors
- Upskilling: collectively raising the skills and shared understanding of observability across all teams
- Best practices: agreeing and documenting a set of best practices and patterns for building highly scalable, performant systems
To achieve these ambitious goals and ensure every team was set up for success, we gave Performance Week a distinct structure.
The day before Performance Week started, we ran a couple of masterclass sessions. These were led by two of our principal engineers who are experts in observability.
The first masterclass provided an overview of our observability platform, Datadog, and delved into its incredible range of features for analyzing and diagnosing performance issues. The session also walked through a sample Datadog dashboard that teams could use as a template when creating their own, helping to improve the quality and consistency of our team and domain-level dashboards. This session was applicable to all engineers.
The second session provided a deeper dive into frontend observability, exploring Datadog’s real user monitoring (RUM) capabilities, along with the performance features of Chrome’s Developer Tools. This session was aimed more towards frontend engineers, but was also relevant for quality engineers and UX designers.
There was high engagement across both sessions, and it was a great way to bring people together to develop requisite skills and increase confidence with observability. It also got people excited about Performance Week and highlighted the potential for each individual to get involved and contribute to our success!
Across the product organization, we have subject matter experts in key areas of performance and observability, and these skills were made accessible to all teams throughout the week. We created a dedicated Slack channel where anyone could reach out for support, with subject matter experts (SMEs) also dropping in to see how teams were performing and offering guidance as needed. Our test automation team was also on hand to support teams with specific load testing requirements, helping to quickly validate performance improvements.
Performance Week review
We ended the week in style with a session for everyone who’d taken part. It followed a similar format to a Sprint Review meeting – each team presented outcomes from the week, demonstrated performance improvements they’d been working on, and covered plans for future performance improvements. This meeting turned out to be one of the highlights of the year, and I was blown away by the ingenuity teams had shown in solving some complex technical challenges, and by some of the tangible performance improvements achieved! 🚀
Performance Week fully exceeded my expectations, and has been the catalyst for embedding a performance culture and mindset across the entire product organization.
Through the range of performance improvements that were delivered, both during and since Performance Week, we’ve:
- Reduced overall page load times by over a second, significantly reduced the load on our databases, and have a well-defined backlog of further performance improvements that will shape our priorities for 2023
- Improved observability of system performance across the platform – automated alerts have been created to notify teams when performance issues arise, and teams are able to quickly analyze performance issues to establish the root cause
- Empowered teams with a stronger sense of technical ownership and greater confidence in managing their products and domains in the production environment
- Critically, made performance front-of-mind for all teams – from the inception of a new product idea, through to deploying and monitoring the changes in production
The true test of Performance Week has been observing how our platform has handled a significant increase in usage through 2022. It’s been gratifying to see stable system performance as the platform has scaled to meet demand, and how quickly teams have been able to identify and remediate performance issues as they’ve arisen.
But the most important outcome of this has been our ability to continue providing an exceptional experience to our customers all around the world! 🙌
Improving the performance of cloud-based systems requires a holistic strategy that considers all the different factors that can impact performance. Our approach of bringing a cross-functional team together for a short period of time with a shared goal and a singular focus on performance proved highly effective. I would recommend this approach to any organization with concerns about the performance of their system, or where there’s a desire to instill a performance culture and mindset more deeply across their teams. When you bring people together and foster the right conditions, what can be achieved is truly magical.⭐