How a Terraform + Salt + Kubernetes GitOps infrastructure enabled a zero downtime hosting provider switch

The switch

It has been a busy weekend: I switched the hosting provider of my whole cloud infrastructure from DigitalOcean to Hetzner.
If you are reading this it means that the switch is completed and you are being served by the Hetzner cloud.

The interesting fact about the switch is that I managed to complete the transition from one hosting provider to another with zero downtime.

The Phoenix Server

One of the underlying pillars that contributed to this success story is the concept of Phoenix Server. In other words, at any moment in time, I could recreate the whole infrastructure in an automated way.

How?

  • The resource infrastructure definition is declared using Terraform. By harnessing the Terraform Hetzner provider, I could simply terraform apply my infrastructure up.
  • The configuration definition is powered and makes use of Salt, versioned in Git.

At some point in time, I made the big effort of translating all the configurations, the tweakings and the personalization I made to every part of the infrastructure and prepare a repository of Salt states that I kept updated.

Two notable examples: I am picky about fail2ban and ssh.

The result is that, after provisioning the infrastructure, I could configure every server exactly how I want it by simply applying the Salt highstate.

  • The application stack relies on containers: every application runs in its container to be portable and scalable. The orchestration is delegated to Kubernetes.

After all the steps above were applied and I have an identical infrastructure running on Hetzner, the old infrastructure was still working and serving the users.

DNS switching

At this point, I had just prepared a specular environment running in Hetzner cloud. But this environment was not serving any client.

Why?
Let’s consider an example to explain the next step.

This website, www.michelebologna.net, is one of the services running by the infrastructure.
Each user was still resolving www.michelebologna.net using the old address: the old infrastructure was still serving it.

To test the new infrastructure, I fiddled with my /etc/hosts and pointed www.michelebologna.net to the new reverse proxy IP (Note: this is required to bypass the load balancers): I verified it was working and that meant I was ready for the switch.

The switch happened at the DNS level: I simply changed the CNAME for the www record from the old reverse proxy to the new one. Thanks to the proper naming scheme for servers I have been using, the switch was effortless.
After the switch, I quickly opened a tail in the logs of the reverse proxy: as soon as the upstream DNSes were updating the record, users were accessing the website via Hetzner, success!

Trivia: after 5.5 years, the old reverse proxy was shut down. In memory of it, its uptime records with an astonishing availability at 99.954%!

     #               Uptime | System                
----------------------------+-----------------------
     1   112 days, 18:33:34 | Linux 4.4.0-tuned
     2   104 days, 21:00:22 | Linux 4.15.0-generic
     3    85 days, 19:08:32 | Linux 3.13.0-generic
     4    78 days, 19:04:49 | Linux 4.4.0-tuned
     5    71 days, 13:01:09 | Linux 4.13.0-lowlaten
     6    66 days, 04:42:44 | Linux 4.15.0-generic
     7    62 days, 15:49:14 | Linux 3.19.0-generic
     8    62 days, 00:52:09 | Linux 4.15.0-generic
     9    56 days, 22:21:20 | Linux 3.19.0-generic
    10    53 days, 16:34:11 | Linux 4.2.0-highmem
----------------------------+-----------------------
    up  1989 days, 03:46:34 | since Tue Oct 28 14:28:05 2014
  down     0 days, 22:00:33 | since Tue Oct 28 14:28:05 2014
   %up               99.954 | since Tue Oct 28 14:28:05 2014

After updating the DNS records for all other services, I was still checking if any service was still being accessed using the old infrastructure. After some days with minimal activity in the old infrastructure, I decided to destroy the old infrastructure.

Caveats with DNS

There are some things that I learned while doing these kinds of transitions. Or maybe, that I learned last time but I did not write down, and I am using this space as a reminder for the next time.

  • A DNS wildcard record (*.michelebologna.net) that gets resolved to a hostname (a catch-all record) can generate weird results if you are running a machine that has search michelebologna.net in its resolv.conf
  • Good hosting providers offer the ability to set a reverse DNS for every floating or static IP address for every cloud instance. A reverse DNS must reflect the mail server hostname (in Postfix)
  • With email hosting, set up DKIM and publish SPF, DKIM, and DMARC records in the DNS
  • The root record (@) must not be a CNAME record, but it must be an A/AAAA record

2 thoughts on “How a Terraform + Salt + Kubernetes GitOps infrastructure enabled a zero downtime hosting provider switch”

    1. Yes, I run k8s on more than 3 VMs.
      The storage is defined in multiple external block storage volumes.
      Overnight scripts take care of syncing the data between the external block storage volumes.
      The database is running in k8s too, with the same strategies for storage described above.

Leave a Reply