<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=2750159335126703&amp;ev=PageView&amp;noscript=1">
Skip to content

Incident Analysis: Lessons Learned from December’s Incident

Insights
Posted on 13 January 2025
Read time 2 mins
Author David (CTO) ⚙️

We believe in being brave and intentional in everything we do including how we communicate during challenges.

When incidents happen, it’s an opportunity to demonstrate our drive for results, share what we’ve learned, and improve together and this blog reflects that commitment by providing an honest account of what occurred, how we addressed it, and the steps we’ve taken to ensure a better experience for our customers moving forward. 

 

On 20th December 2024 at 10:45 UTC, multiple internal and external sources started reporting intermittent timeouts and access issues affecting Tillo APIs and its user portal.

Once the incident was confirmed, at 11:03 UTC, we triggered our incident response procedure and formed a cross-functional incident management team.

We quickly established that majority of those intermittent timeouts were caused by dropped DNS packets and that the most affected were services that communicate with AWS Elasticache.

Since our entire workload is in AWS cloud, we immediately opened a support ticket with DoiT, our AWS support partner who then escalated to AWS Enterprise Support. AWS support helped us identify the root cause of these intermittent network timeouts.

Although network disruption lasted only 97 minutes and normal service resumed at 12:22 UTC, it took AWS support teams until 16:47 UTC to determine the root cause of the incident. Resolving stuck Storefront orders and other data inconsistencies caused by this incident lasted until 22:02 UTC.

Why did it happen?

All our services operate in AWS Elastic Kubernetes Service (EKS), where we run a single static node group with three t3.medium nodes and a Karpenter managed pool of dynamically allocated spot EC2 instances. The managed node group is primarily for running Karpenter and CoreDNS and so is a relatively small instance type as we want most of our cluster capacity to run on spot instances.

EKS and Karpenter interacted in such a way as to force many of our pods onto the managed node group. This importantly included at least one of our CoreDNS pods. Due to the way Karpenter forces binpacking this caused managed nodes to become overloaded with Tillo workloads.

Because Tillo workloads are network heavy applications we exceeded the conntrack_allowance_exceeded limit for Elastic network interfaces (ENIs) attached to managed nodes. This caused packets to start getting dropped by the ENI. Because CoreDNS was running behind those ENIs this presented as a DNS issue and because CoreDNS was down this caused cascading failures across the cluster not just the affected nodes.

The high load on the system caused our autoscaling to be highly volatile. This had the effect that workloads got moved around a lot and there was the pressure on the cluster that isn’t normally present. The frequent movement is what caused this issue to be intermittent and the high pressure is one of the reasons it happened in the first place

Response and recovery

Once the root cause of the incident had been established, our Platform team implemented several EKS and carpenter configuration changes to prevent future occurrence of this incident.

  • Nodes in the managed node group were upgraded to c6a.large to increase network performance limits.
  • Karpenter has been tuned to only spawn instance sizes that are large enough that we shouldn’t hit the network performance limits. Larger instances have higher network performance limits.
  • Tillo Platform team has tainted the managed node group such that only approved workloads can be scheduled there (CoreDNS, DataDog, and Karpenter). All other workloads will now only be scheduled on Karpenter spawned nodes.
  • Number of Apache workers in PHP applications was increased. This had an effect of consolidating application workload into a fewer number of pods, thereby decreasing the amount of health checks and metric uploads that were adding toward network limits.

Product Impact

Storefront, our relatively new B2C product, includes built-in safeguards to handle the complex gift card fulfilment process. These safeguards allowed us to retry fulfillment for affected orders, ensuring no orders were lost.

Conclusion

We thank our customers for their patience and understanding while we were dealing with this incident and we are deeply sorry for any disruption and inconvenience. We are committed to learning from this experience, and to delivering improvements to our service and our communication. We have already implemented most of the changes outlined above and will continue our diligence to ensure this cannot happen again.

Tillo dots icon

Sign-up to our mailing list for the latest rewards, incentives & gifting insights