Service status updates

srcf.ucam.org domains were temporarily nonexistent, 2023-11-10

The ucam.org domain (link may not work) under which ‘srcf.ucam.org’ exists temporarily disappeared from the Domain Name System today. During that time, we regret that emails to @srcf.ucam.org addresses will have bounced (reported delivery failures to the sender), and websites weren’t accessible via the www.srcf.ucam.org redirect service, for older accounts with that feature.

webserver.srcf.net systemd services not launched, 2023-10-14

The SRCF web server sinkhole / webserver.srcf.net was rebooted during our scheduled vulnerable period on Saturday 14th October, but a handful of users’ systemd services were not launched when the server booted back up. This is likely due to the server startup timing out and systemd giving up launching the remaining user tasks.

If you were affected, attempts to control existing services with systemctl would have resulted in “Failed to connect to bus” errors.

Unfortunately, due to the small number of accounts affected, this wasn’t noticed until 9 days later, with the remaining tasks launched around 11:45pm on Monday 23rd October. All user services and service management should be back to normal now.

Ancillary services offline - 2023-10-18 16:05-

Due to a loss of power at the West Cambridge Data Centre (WCDC), some non-user-facing services, including backup storage and one of our monitoring systems, have gone down.

Keep reading

Mailman delivery delays, 2023-09-18 and 2023-09-19

The queue processor for Mailman, which runs user and group account mailing lists, quietly became stuck and stopped handling incoming emails. This meant emails were being accepted by our mail server but not being processed.

The logs suggest the problems started around 8am on Monday 18th, with messages backing up until 7pm on Tuesday 19th when the stuck runner was noticed and restarted.

Queued messages were all released together, initially reaching the sending limits of our upstream email relay ppsw, so some existing messages have been deferred and may take a few hours before they make it through.

Poor website performance, Friday 2023-09-15

The SRCF webserver sinkhole was seeing a large number of incoming requests from various IP addresses and servers of a particular cloud provider, likely being used for a denial-of-service attack, and caused performance to drop significantly as the machine became overloaded.

Alerts started at around 4am BST, initial attempts to block problematic IP ranges from making requests were made at 10:30am, but performance continued to vary until about 7pm as the blocking was adjusted.

Total service outage, 2023-02-05 01:58 to 11:20

The SRCF experienced a total outage of its main server cluster (“thunder”), which our monitoring systems noticed from 01:58 onwards tonight.

Real-time updates from the investigation follow:

  • 02:25 – corrected the year in the title (it’s 2023 now!). Signs point to this being a networking failure, either in our upstream network connection to the outside world or in an intermediate network switch that we rely on for this connection. A physical visit to the datacentre would be necessary to confirm this, which we can conduct in the morning.
  • 11:57 – we sent someone on site and discovered that a single electrical circuit breaker (technically an RCBO) had tripped. Our the intermediate switch carrying our network connection, mentioned at 02:25, had a single electrical feed on that circuit, causing disruption to our network connection.
    We have moved this switch over to the alternate power feed, and services have been reachable again since 11:20.

We will continue to monitor the situation remotely and are liaising with building services to resolve any electrical issues. There are opportunities to improve redundancy of power feeds and network uplinks, to eliminate them as single points of failure, which we aim to pursue in due course.

Recovering from power outage, 2022-08-11

Due to a power outage on the West Cambridge site, some of our auxiliary services (those listed here) suffered sudden power loss. Other services, including core services like user files, shell access, email and websites, managed to keep going on battery- or generator-powered backup supplies.

We have since been able to restart downed services after the resumption of power to the site, and we will continue to monitor for any lingering issues.

UPDATE: Power outage - ALL services at risk, 2022-08-11

We’ve learned that the power outage reported earlier may affect the entirety of the West Cambridge site, which physically contains all of our servers and services. Our earlier at-risk warning now applies to the entire SRCF.

Power outage - some services at risk, 2022-08-11

We’ve been informed of a mains power outage at the host site for some of our servers. If battery backup power runs out before mains power is restored, then some of our auxiliary services will be unavailable:

  • Sysadmins’ primary mailbox store (so we may take longer to see mail you send us)
  • Realtime monitoring/probing host (so we will have less oversight of the status of the rest of our infrastructure)
  • One IRC network node (so the IRC network will have less resiliency, in particular with one remaining node inside the UDN)
  • Graphical VM management host (so we may have to resort to stone-age command-line methods to manage our fleet of servers…)

UK heatwave watch, Mon 18/Tue 19 Jul 2022

Temperatures over much of England are forecast to reach 40 °C in a few days’ time, on Monday 18 and Tuesday 19 July. We may need to SCRAM some or all of our servers if cooling systems fail as a result.

Keep reading