Email (Mercury mail platform)

resolved

Description

Some clients on the mercury mail platform have been reporting slow downs and time outs on logging in. Engineers have traced this to degraded performance of the mail storage array and are investigating reasons for the degraded performance. Further updates to follow once their investigation is more complete.


Updates

3rd September 2018, at 11:52AM

[resolved] As we have had no further reports of issues for 2 weeks, this problem is now marked as resolved.

21st August 2018, at 11:15AM

[monitoring] We will continue to monitor the Mercury email service closely and perform the previously mentioned upgrades soon.

18th August 2018, at 6:39PM

[unresolved] Logging and analysis of the storage system proceeded through most of Friday, resulting in a valuable data set for our storage vendor to work from.

With their help we were able to apply configuration changes on Friday that have helped improve some areas of performance. Through late Friday and into Saturday we have seen vastly improved performance. At this time we expect you should be able to collect your email with little to no error.

Moving forward we have also identified several areas of improvement to address with the vendor, including the discovery of a firmware bug at the motherboard level that we can address with direct maintenance work on Sunday (Aug 19). We hope that this coupled with the existing improvements already made should keep the system performance at acceptable levels.

That said, later into the week of the 20th we hope to deploy additional system improvements in the form of RAM and CPU power to further reduce the risk of performance failure.

17th August 2018, at 4:32PM

[unresolved] The status of this issue remains unchanged at this time. We are awaiting a review of the system load by the hardware vendors in order to assess how to proceed.

Please note that email is being received and flowing. As per earlier advisories, should you leave your IMAP and POP client open, you will connect and email will be downloaded - you just might see a slight delay here and there.

17th August 2018, at 11:39AM

[unresolved] We are currently aware of high loads affecting our mail services, symptoms vary from slow webmail or mail loading times and emails appearing blank.

Our engineers are actively monitoring and investigating the latest load increase and working to resolve this as soon as possible.

Further updates to follow once their investigation is more complete.

16th August 2018, at 8:00AM

[monitoring] Work by engineers and the vendors engineers continued overnight. We are now monitoring, loads and seem to be back to normal and email is being delivered. We will continue to monitor throughout the day to see if there is a slow down during peak hours this morning.

15th August 2018, at 7:53PM

[unresolved] From 4-5 p.m. onwards we have seen a marked improvement, and as of early evening things are running quite fast and back to optimal levels.

That said the issue is not formally rectified, further delays in times of peak load may be possible. We continue to work actively with the hardware vendor to assess the cause of the performance degradation during peak load.

15th August 2018, at 4:50PM

[unresolved] Work continues apace with our hardware vendors. As the fault has persisted longer than we'd like we wanted to give a little more detail into the issue, causes, and work at hand. At a very high level, the issues being experienced are being caused by a slow down of the disk array we use to power the Mercury email platform.

The Mercury storage system uses a very complex and sophisticated disk array, provisioned to avoid this very scenario. The system uses two physical collections of disks, separated into different machines. Each machine houses dozens and dozens of physical hard disks. Client email data is then stored across both machines and multiple disks in each. The idea being that a failure anywhere in the setup is irrelevant. Also, just for good measure, the whole storage is duplicated a third time as a backup.

In total we currently manage more than 50TB of primary email storage (not counting the triplicated backups). That huge array of disks is in turn managed by two physically separated head controllers - overseeing all of that data storage; again should one fail, the other is a live standby. This is just the storage element of Mercury, tens and tens of additional servers work to authenticate, serve, filter and deliver email.

This fault though, affects just this storage system. Therein, this morning we started to see degraded performance on the storage array. This manifests itself to the client as a seeming inability to collect email. As an aside if you do keep an IMAP client open on your machine, you will receive your email, the system isn't down - it is just very slow. Our team have been testing this through the day internally on personal email hosted on Mercury, and are able to collect email albeit slowly.

Once we identified this issue we started to review the case with our supplier of the hardware. Naturally it's a quite expensive and complex agglomeration of hardware and we have a very solid support 24/7 contract to ensure we have expertise on hand when we need it in critical times like this.

We began a two pronged approach to the problem with their assistance. The initial working theory wass that performance might be degrading from a lack of free space overall. For reference we keep the array with 30-40% free space at all times, but it was possible that might not be enough (though these are the recommended levels for this piece of hardware). At any rate, we have additional spare physical disks in the array already for this eventuality and the vendor assisted us in making these available for use. These were successfully enabled in the last few minutes.

In the event the issue is unrelated to overall free space the vendor is also separately analyzing the performance of the array as its own event. Both of these items are proving slow to progress as the overall performance slow down of the array is complicating matters.

With the storage pool expanded we're monitoring the system and hoping for better performance. A compounding issue is that the system now has exceptional levels of load, e.g. frustrated users trying to collect email. This in turn puts unusually increased load on the array. We're currently working to mitigate that at this time.

15th August 2018, at 2:12PM

[unresolved] Access has improved but not to the standard we had hoped. Our storage software vendor is now currently aiding in our analysis.

15th August 2018, at 12:11PM

[unresolved] Engineers have a working theory that hardware configuration is at fault and are currently attempting to swap physical hard drives in order to rule this out.