What happened on August 11 and what we've learned

August 19, 2020 · written by Author Image SimpleLogin team

On August 11, 2020, some SimpleLogin users have experienced up to 8 hours of email delay. We deeply apologize for this incident and have made some measures to avoid this issue from happening again. No emails were lost during this time.

Here’s the timeline of what happened and the measures we’ve done to better handle these situations.

First vague

In the morning, we noticed that the second server mx2.simplelogin.co had a peak of emails. This server is the failover of the principal one (mx1.simplelogin.co) and usually only handles a fraction of emails. We also received emails from some SimpleLogin users asking about the email delay.

Checking the server mx1, the email handler container was down even if it’s set up to automatically restart. We noticed that SpamAssassin, a program used for detecting spams is taking 100% of the CPU.

We decided to scale up the main server. After redirecting most of the traffic to the second server, we increased the first server capacity 4x. Everything seemed to be back to normal and pending emails were quickly sent.

Second vague

In the afternoon, we again noticed that the email queue was abnormally high. Turns out that all requests to SpamAssassin timed out which delayed Postfix email delivery. We had to proceed to the emergency solution of disabling the spam checking. Email delivery was back to the normal but we know that this is just a temporary solution.

Actions

We made the following actions to avoid similar issues from happening in the future:

Lessons

In this incident, we have learned to be extra careful when working with software that we don’t have much control over. SpamAssassin seems to be the root cause of this incident but this can happen to any other software in the stack.

Things we learned when investigating the issue: