Ad Serving Is Not Operational
Incident Report for Nativo
Postmortem

Summary

  • On September 21, 2020 4:05PM PDT, we deployed a small patch to help resolve an issue with Grapeshot topic classification
  • As a result of the patch, newly added ad servers were not able to launch. At that time, there was no need for additional ad serving capacity and the system was operating normally.
  • During night time, our system automatically scales down the number of servers available to account for the decrease in traffic
  • Around 4:00AM PDT the following day, as regular overall traffic increased, the system automatically attempted to add new ad servers to handle the increasing load. All attempts to add those server failed due to the issue described above leaving the ad serving system with reduced capacity
  • Around 4:40AM PDT the increase load caused the ad servers to start throttling the requests and our team got alerted and began investigating the issue
  • Around 5:00AM PDT the ad serving system was effectively down as it was no longer able to handle the increasing load
  • The root cause was identified around 6:00AM PDT and the patch was quickly reverted. By 7:00AM the ad serving system was back at 100% availability and serving ads. By 8:00AM PDT all budgets pacing was back to operating at normal rate

Improvements

Nativo Engineers identified various areas that can improve our ability to prevent such cases from happening to begin with or have a better way to be alerted on it in earlier stages:

  • Reduce the logical steps and dependencies on ad server startup phase (done)
  • Better separation for errors and fatal errors and adjusting alerting thresholds around fatal errors (partially done and expected to be completed in the next few weeks)
  • Improve monitoring post patch releases (done)

In addition, the team has identified various areas of improvements that can be helpful in managing such incidents and bringing the system back online. We expect those improvements will be in place in the next few weeks.

We apologize for the inconvenience and the disruption in the service and are committed to continue and improve our product to ensure the highest level of service and support to all of you.

Thank you,
Nativo Engineering Team

Posted Sep 24, 2020 - 10:21 PDT

Resolved
Ad serving had a degradation of serving starting 4:30AM and was fully down starting 5:05AM. Issue was fully resolved around 7:05AM Pacific time
Posted Sep 22, 2020 - 04:30 PDT