We believe in being brave and intentional in everything we do including how we communicate during challenges.
When incidents happen, it’s an opportunity to demonstrate our drive for results, share what we’ve learned, and improve together and this blog reflects that commitment by providing an honest account of what occurred, how we addressed it, and the steps we’ve taken to ensure a better experience for our customers moving forward.
On 20th December 2024 at 10:45 UTC, multiple internal and external sources started reporting intermittent timeouts and access issues affecting Tillo APIs and its user portal.
Once the incident was confirmed, at 11:03 UTC, we triggered our incident response procedure and formed a cross-functional incident management team.
We quickly established that majority of those intermittent timeouts were caused by dropped DNS packets and that the most affected were services that communicate with AWS Elasticache.
Since our entire workload is in AWS cloud, we immediately opened a support ticket with DoiT, our AWS support partner who then escalated to AWS Enterprise Support. AWS support helped us identify the root cause of these intermittent network timeouts.
Although network disruption lasted only 97 minutes and normal service resumed at 12:22 UTC, it took AWS support teams until 16:47 UTC to determine the root cause of the incident. Resolving stuck Storefront orders and other data inconsistencies caused by this incident lasted until 22:02 UTC.
All our services operate in AWS Elastic Kubernetes Service (EKS), where we run a single static node group with three t3.medium nodes and a Karpenter managed pool of dynamically allocated spot EC2 instances. The managed node group is primarily for running Karpenter and CoreDNS and so is a relatively small instance type as we want most of our cluster capacity to run on spot instances.
EKS and Karpenter interacted in such a way as to force many of our pods onto the managed node group. This importantly included at least one of our CoreDNS pods. Due to the way Karpenter forces binpacking this caused managed nodes to become overloaded with Tillo workloads.
Because Tillo workloads are network heavy applications we exceeded the conntrack_allowance_exceeded
limit for Elastic network interfaces (ENIs) attached to managed nodes. This caused packets to start getting dropped by the ENI. Because CoreDNS was running behind those ENIs this presented as a DNS issue and because CoreDNS was down this caused cascading failures across the cluster not just the affected nodes.
The high load on the system caused our autoscaling to be highly volatile. This had the effect that workloads got moved around a lot and there was the pressure on the cluster that isn’t normally present. The frequent movement is what caused this issue to be intermittent and the high pressure is one of the reasons it happened in the first place
Once the root cause of the incident had been established, our Platform team implemented several EKS and carpenter configuration changes to prevent future occurrence of this incident.
Storefront, our relatively new B2C product, includes built-in safeguards to handle the complex gift card fulfilment process. These safeguards allowed us to retry fulfillment for affected orders, ensuring no orders were lost.
We thank our customers for their patience and understanding while we were dealing with this incident and we are deeply sorry for any disruption and inconvenience. We are committed to learning from this experience, and to delivering improvements to our service and our communication. We have already implemented most of the changes outlined above and will continue our diligence to ensure this cannot happen again.