We met on the sidelines of Sepang International Circuit during the Malaysia Formula One Grand Prix this weekend. Sitting outside of Toro Rosso’s garage, in extreme heat and stifling humidity, Daniel shared with me what it takes to maintain the network for the most data-intensive and technologically advanced sport on earth.
Daniel, thank you for taking a few minutes out of your busy schedule. We can see the F1 cars flying past and you’re obviously confident that Toro Rosso’s network is doing well. Could you give us some insight into what it takes to set up the IT infrastructure for a Formula One race?
It takes a lot! The main Formula One race is on Sunday, but we’ve already been here for a week. When you arrive at the race circuit, it takes two days to set up the team infrastructure — offices, the pit wall, the garage — and interconnect it all with a LAN [local-area network]. Like all the other teams, we bring our own self-contained portable datacenter to power the network. It needs to be unpacked, connected, and tested. It may have been a week since the last time it was turned on, so we need to make sure everything performs perfectly, the way we want it. The portable datacenter always stays inside a transportation unit. It’s easy to put it on a plane and take it with us everywhere we go. It houses all of the server equipment we need during the race weekend.
What happens if the portable datacenter doesn’t make it to the race?
It’s never happened before, but if it doesn’t make it to the track, we have some additional hardware on standby at our factory in Faenza. Theoretically, we could ship it out here if we needed to.
And do you ship the race portable datacenter back to Faenza after each race?
Yes. If there is a sufficient gap between the races, we send it back to our factory Faenza for maintenance. Obviously, this is where it stays as well at the end of the season for maintenance and upgrades.
Do you ever experience any faults in your datacenter, and if so, how do you deal with them?
We’ve had a few, but nothing major. Normally it’s something to do with the UPS [Uninterruptible Power Supply] units. They give us a lot of grief because they really struggle in hot temperature and high humidity — just like we have here in Malaysia. Another problem that we have, which is unique to F1, is carbon fiber dust. For some reason, it really affects electronic equipment. Hot weather, humidity, and carbon fiber dust is not a good combination for server equipment.
So what happens when a UPS fails?
We had a few UPS failures in the past but it’s more of a nuisance rather than a real problem. Naturally, we don’t rely just on one UPS. We have multiple hot and cold spares. If a unit fails during the race, automatic power failover ensures that everything stays up.
I guess the UPS is important not only for power failover but also for power conditioning?
Yes. In some F1 circuits, power and the voltage can fluctuate. We need clean 50Hz power for our network.
What kinds of servers do you have in there?
We have eight physical servers, plus the usual networking equipment. Out of the eight, six are ours and the other two belong to Ferrari, our engine supplier.
Our servers run VMware and Windows, each serving different purposes. For VMware, we use HP servers connected via a storage area network to an HP storage array. We have two storage arrays for extra redundancy and some load balancing. These arrays use solid-state drives with no moving parts, not only for performance but also because SSDs are more resilient against vibration.
So one of these VMware hosts has a virtual machine with the Atlas Data Server?
We actually have a few Atlas Data servers. All of them run as virtual machines. These Atlas Data Servers are critical to our race (because they process all of our telemetry data) we enabled VMware fault tolerance in each machine. So even if one of these VMs fails when the car is on the track, the telemetry will keep coming through.
When does the telemetry start coming in?
As soon as the car starts. the Atlas Data Server starts logging data automatically. The engineers do several telemetry checks throughout the day and also they start the car several times to make sure all sensors are working.
Are there any sensors that give you trouble?
You would think so! With over 200 physical sensors and thousands of data points, you would expect some sensors to misbehave! Fortunately, our system works really well.
So what do you do if you see something that doesn’t look right?
[Laughs] We just tap the little graph on the screen and walk away! But no, seriously, our tech engineers always know exactly what the issue is.
Speaking of sensors, we have a lot of virtual sensors as well. When the data comes off the car, it’s relayed into a virtual ECU [electronic control unit]. The virtual ECU takes available telemetry information from the car and makes appropriate calculations. As you can imagine, there are some places in the car where you can’t place a physical sensor. So we have to make predictions about what is actually going on. There are thousands of virtual sensors that process data like this.
Now tell me about your Windows machines, what are they for?
Well, one of those is a backup server. With our partnership with Acronis, you’re going to hear a lot about it! It has 60 terabytes of plain hard disk drive storage, because all we need is capacity. We use RAID, a redundancy technology for traditional hard drives, to provide some failure protection.
Is this the only backup server that you have?
Yes, it’s the only one that travels with us.
Do you replicate the data offsite?
When we are back in the factory, we archive backups from the backup server in the portable datacenter to our backup servers in the factory. Unfortunately, because of the sheer volume of data that we generate, we are not able to push backups from offsite across the wide-area network.
That said, the telemetry data is recorded live back in the factory during each race. We have a 30Mbps MPLS [wide-area network] link from the racetrack to Faenza.
Who provides your bandwidth?
We have the same provider for all races, Riedel Communications. They deal with local Telecom companies to provide connectivity with strong service-level commitments in each country. They are also physically present here, so when something goes wrong, we can go and beat them up! [Laughs]
The fact that they have people here at the racetrack is a really good thing. Think about it: if you lose your MPLS link, you lose your connectivity. You won’t be able to submit an email support request, either, because there will be no email!
What other services do you use?
Only the standard set. FIA [Formula One’s governing body] provides GPS and weather information. And we also use F1 management services: timing information, TV feeds and so on.
Now that you told me about the setup, can you tell me what runs through your head when you get up in the morning during a race weekend?
Well, the very first thing that I do when I wake up, is check my phone. Not for how many likes I got on my Facebook account, but to see if there are any service faults. I get a lot of status emails and SMS texts from different servers. Some alerts are informational, such as backup status; others can be more serious. So, I get up, check my emails, and then worry all the way to work! [Laughs]
Sunday, the day of the main race, is very nerve-racking! I’m always very nervous when I first arrive at the track.
So you walk in, slowly switch on the monitors and hope everything is alright?
No, the system is actually very reliable. When I arrive on site, I go through the daily checklist. Check everything is working. And then do it again! [Smiles]
What do you use for monitoring?
We mainly use Cacti, with many different plugins. I like it because you can customize it quite a lot. We’ve been playing with other monitoring software too, but haven’t found anything that we like yet.
Have you tried Acronis Monitoring?
No, not yet. But I like what I’ve seen so far. I like the idea of complete data protection, where data backup, file sync, and monitoring are provided by the same vendor. Acronis is #1 in data backup, and we’ll soon find out if it’s #1 in server monitoring too!
So you come to work, check the graphs, to make sure everything is nice and green?
Not really. We always have red graphs, but it’s because some systems are not used or switched off. For example, our IT configuration in Europe is slightly different. There are some devices that we only use for European F1 races and nowhere else. In Europe we don’t fly everything in: we use trucks for shipping, so it’s a little different.
How do you think Acronis is going to make your job easier?
For an IT engineer like me, uptime and backup is everything. Acronis’ new products sound very exciting. Things like Acronis Instant Restore, technology that allows you to restore very quickly, is exactly what we’re looking for. When things do go wrong, just knowing that your data is protected is the best feeling that a sysadmin can have.
Thank you so much for your time and good luck to your team!