Return to jobs Return to jobs

Site Reliability Engineer

Duffel

Clock

Posted 2 weeks ago

Join Duffel in reimagining the world of travel by building transformative tools and infrastructure.

Overview

icon Salary

No salary declared 😔

icon Location

London

icon Expires

Expires at anytime

As a pioneer in the travel industry, Duffel is on a journey to refresh the current infrastructure and build tools that will revolutionize travel distribution, search and booking. Backed by esteemed investors and being part of Y Combinator, our London-based team is expanding and we are seeking resourceful individuals to become an essential part of our growth.

Role Summary:

  • Working on systems engineering and investigating various issues.
  • Ensuring the reliability, performance, and resilience of our infrastructure.
  • Collaborating with multiple engineering teams.


Role Requirements:

  • A knack for software development and systems engineering.
  • Experience in incident response and proficient understanding of observability and reliability practices.
  • Superb communication skills and an aptitude for big picture thinking.
  • Experience with Google Cloud Platform, Infrastructure as Code, GitOps, Grafana among others will be beneficial.


At Duffel, we are committed to your personal growth, providing a receptive environment where your ideas matter. Apart from a steady career growth, you also become a part of our vision, owning a share of the company. We encourage applications from everyone, reiterating our belief in equal opportunities.

Create the future of travel with us

Whether it’s to visit the people closest to us, starting an exciting adventure, or a career-defining business trip, travel is an essential part of our lives. Yet we've all experienced the aches and pains of getting to our destination. Today, more than 4 billion airline passengers rely on technology that hasn't kept up with the expectations of the modern connected traveller.

That’s why we’ve started to rebuild the infrastructure that underpins the travel industry. We’re on a mission to unravel travel — simplifying systems and building the tools that will make the future of travel effortless.

We were part of Y Combinator S18's cohort and we are backed by Benchmark, Blossom, Index Ventures and Kima Ventures. A fantastic set of investors that has helped build some of the world's largest companies.

Our team in London is growing and we’re looking for talented people to join us on our journey

Engineering at Duffel

We're building tools to simplify travel distribution, search and booking. What does this actually mean? It's one common and seamless API. This brings huge technical challenges as we need to design and build a beautiful API before integrating to hundreds of airlines. Along with that we need to navigate through the differing needs and systems of each airline whilst building a fantastic developer experience to go with it.

The tools used on the team include Elixir, Phoenix, Kubernetes and Google Cloud Platform.

Site Reliability Engineering at Duffel

As an SRE at Duffel, you’ll be part of a small team within engineering that is responsible for the reliability, performance, and resilience of our infrastructure and applications. You will be working closely with engineering teams to understand their needs and help meet the demands of our product as we scale globally.

What we're looking for

- An infrastructure and systems engineering generalist who is comfortable diving deep into the weeds on different issues. Some recent examples include:

    - A configuration issue between Google’s Load Balancer and the HTTP server in our main Elixir application causing HTTP 5XX responses to be returned to our customers.

    - Debugging an issue in our OpenTelemetry pipelines causing us to silently drop spans.

- An enthusiasm for both software development and systems engineering.

- A high bar for code and configuration quality and readability.

- A good understanding of current observability and reliability practices.

- Experienced and comfortable in running incident response.

- Big picture thinking - you can make trade offs on technical work streams against business impact.

- Fantastic communication skills. You're able to articulate what you're working on and why to the team in a clear and structured way.

- You thrive in a collaborative environment. You believe in your own methods but keep an open mind, taking suggestions and feedback onboard as well.

Technologies

Don’t worry if your experience doesn’t exactly align with this stack, we understand that skills are transferrable. This is to give you an idea of what you’ll be working with if you join the team.

- We run our infrastructure on Google Cloud Platform, so you’ll be helping to run a few of their products such as GKE, CloudSQL for PostgreSQL, BigQuery, Memorystore (Redis) and more.

- We manage the infrastructure and security for a segregated PCI Cardholder Data Environment, entirely managed with Google Cloud Platform services and tooling.

- We follow an Infrastructure as Code approach to managing our infrastructure, using Terraform.

- We follow a GitOps approach to managing our Kubernetes configuration, using ArgoCD and Helm.

- We manage a high-availability metrics collection system using Grafana, Thanos & Prometheus. We’re in the process of transitioning to OpenTelemetry and Honeycomb for our application telemetry (traces and metrics).

- We manage a data pipeline using Pub/Sub, Airbyte, and dbt.

Our Current Focus

We’re currently driving a big shift in how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues.

We’re extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single place for engineers to understand how our applications are operating in production.

This project involves both technical work, on the application libraries and infrastructure that make up the OpenTelemetry pipeline, and an education piece, working to change perceptions and behaviours across engineering.

The Future

- We currently run all our services from a single European region in Google Cloud. In the medium term, for performance, reliability, and data residency reasons, we’ll be starting to think about how to (re)architect our applications and infrastructure to span multiple regions, operating globally.

- We deploy our application multiple times a day, but deploys are all or nothing, and when we encounter issues, roll backs are slow. One way to address this would be to invest in CI/CD performance improvements, but we’d also like to explore alternative deployment strategies like Canaries, Blue/Green, and traffic mirroring, and get more comfortable testing changes in production with real customer traffic.

What you can expect from us:

We're dedicated to your personal growth. Our environment is comfortable both physically, but also in that our ears are always open to any ideas, concerns and questions. We believe that everyone should have pride in their work, taking full ownership of it and its impact. That's why everyone who joins Duffel owns a share of the company.

*We are an equal opportunities employer. We believe that the key to our success is employing a diverse team, that's why recruitment decisions are only based on your experience and skills. We value your ability to problem solve and build amazing things so we welcome applications for everyone – regardless of age, sex, disability, sexual orientation, race, religion or belief.*

Note to recruitment agencies

Duffel does not accept speculative CV's from external parties. Any unsolicited CV's sent to us will be treated as property of Duffel, and any attached terms and conditions associated with these CV's will be null and void.


Organisations to follow.

Medal
Computer

FOR ORGANISATIONS

Your progressive people partner

Post your jobs, become a Top 1% Employer and more. We work with organisations who aspire to do things differently.

Learn More
*** 🚨 Announcing Top 1% Employer: Escape Verified 💥 ***