Self-Hosted GitHub Runners on GKE: My $800/Month Mistake That Led to a Better Solution

O

Ogonna Nnamani

Guest
Hero - Image
So, It's 3 PM on a Friday, your team is trying to push a critical hotfix, and GitHub Actions decides to put your build in a queue. For 15 minutes. Then 20 minutes. Your deployment window is closing, the stakeholders are breathing down your neck, and you're watching your GitHub Actions bill climb past $800 for the month.

That was my reality six months ago. And like most problems that keep you awake at 2 AM, it started small and innocent.

The $800 Problem That Kept Getting Worse​


It began innocently enough. Our team grew from 3 to 15 developers, our deployment frequency increased, and suddenly our GitHub Actions usage exploded. What started as a manageable $100/month became $400, then $600, then crossed the dreaded $800 mark.

But the cost wasn't even the worst part. The worst part was the waiting.

During deployment rushes, builds would queue for 10-15 minutes. Developers would start their builds, then go grab coffee, chat with colleagues, or worse – start working on something else entirely, breaking their flow. Our feedback loops became molasses-slow, and productivity plummeted.

I'd sit there watching the queue, thinking: "There has to be a better way."

Spoiler alert: There was. And it involved making some spectacular mistakes along the way.

Enter Actions Runner Controller (ARC): The Light at the End of the Tunnel​


After countless late nights researching alternatives, I stumbled upon Actions Runner Controller (ARC). Think of it as having a smart assistant who only hires contractors when there's work to do, then sends them home when they're done.

Traditional GitHub runners are like having a full-time employee sitting at their desk 24/7, even when there's no work. ARC creates ephemeral pods – containers that materialize when a job arrives, do their work, and vanish when complete. It's beautiful in its simplicity.

The promise was tantalizing:

  • Scale from 0 to 100+ runners instantly
  • Pay only for compute you actually use
  • Never wait in GitHub's queue again
  • Full control over the runtime environment

Of course, getting there required navigating through my usual minefield of spectacular failures.

Step 2: Installing the ARC Controller (And My First Epic Fail)​

Mistake #1: The Great Firewall Fiasco​


ARC FLOW
My first attempt at installing ARC was... educational. I spent an entire weekend setting everything up perfectly, only to find that GitHub couldn't talk to my cluster. Webhooks were failing, runners weren't registering, and I was questioning my life choices.

The culprit? I'd forgotten to configure the firewall to allow GitHub's webhook IP ranges.

Face, meet palm.

This taught me my first crucial lesson: networking isn't an afterthought. When you're dealing with webhooks, your cluster needs to be accessible from the internet, and GitHub needs to be able to reach your ARC controller.

How ARC Actually Communicates​


Here's what I learned about the communication flow the hard way:

  1. GitHub β†’ ARC Controller: GitHub sends webhook events when workflows are triggered
  2. ARC Controller β†’ Kubernetes API: Creates/deletes runner pods based on job queue
  3. Runner Pods β†’ GitHub: Self-register and poll for jobs to execute
  4. Runner Pods β†’ ARC Controller: Report status and job completion

The critical insight: GitHub initiates the conversation. Your cluster must be reachable from GitHub's servers, which means:

  • Firewall rules allowing GitHub's webhook IPs
  • Load balancer exposing the ARC webhook endpoint
  • Proper DNS configuration for webhook URLs

What Actually Worked​


After my networking debacle, here's the proper setup:

Certificate Manager First
I installed cert-manager to handle SSL certificates automatically. This ensures secure communication between GitHub and our cluster - because webhooks over HTTP are a security nightmare.

ARC Controller Installation
The ARC controller gets installed in its own namespace (arc-systems) and acts as the orchestrator. It's essentially a Kubernetes operator that watches GitHub webhook events and translates them into pod creation/deletion actions.

Testing Connectivity
Before proceeding, I learned to test the webhook endpoint thoroughly:

  • Verify external IP accessibility
  • Test SSL certificate validity
  • Confirm GitHub can reach the webhook URL
  • Monitor webhook delivery logs

Building the Foundation: GKE Cluster Setup​


Once I'd conquered the networking nightmare, I focused on building a solid foundation. The key decisions that made or broke the implementation:

The Spot Instance Gamble​


I made a bold choice: run everything on Google Cloud Spot instances. These preemptible nodes can disappear with just 30 seconds notice, but they cost 60-70% less than regular instances.

"This will either be brilliant or catastrophic," I thought.

Turns out, it was brilliant. ARC handles preemptions gracefully – if a node gets terminated, pods simply reschedule elsewhere. The cost savings were immediate and substantial.

Storage: Where I Learned About Shared State​


Initially, I ignored storage entirely. "How hard can it be?" I thought.

Very hard, as it turns out.

Without shared storage for build caches, every job started from scratch. Build times were actually slower than GitHub's hosted runners. My "optimization" had made things worse.

The solution involved integrating:

  • Google Cloud Filestore: 250GB shared volume for build caches
  • Smart cache organization: Structured by repository and branch

Suddenly, build times dropped by 40%. Sometimes the obvious solutions are obvious for a reason.

Step 3: Building Custom Images (Learning Docker the Hard Way)​

Mistake #2: The 8GB Image Monster​


My first custom runner image was... ambitious. I threw everything I could think of into it: multiple Node.js versions, Python 2 and 3, every CLI tool I'd ever heard of, and enough packages to power a small data center.

The result? An 8GB monster that took 15 minutes to pull on each pod creation.

Watching developers wait 15 minutes just to start their build was painful. I quickly learned that in the world of ephemeral pods, image size directly impacts developer happiness.

The Right Approach to Custom Images​


After several iterations, I developed a strategy:

docker custom
Base Image Philosophy

  • ubuntu-22-04: Lean base with Node.js, Python, and essential tools only
  • ubuntu-22-04-infra: Infrastructure-focused with Terraform, kubectl, and cloud CLIs
  • ubuntu-22-04-qa: Testing-focused with Selenium, browsers, and test frameworks

Size Optimization Lessons

  • Multi-stage builds to eliminate build dependencies
  • Careful package selection (do you really need that 500MB SDK?)
  • Layer optimization to maximize Docker cache hits
  • Regular cleanup of apt caches and temp files

The Sweet Spot
My optimized images clock in at 1.5-2GB and pull in under 60 seconds. The difference in developer experience is night and day.

The results were dramatic: job setup time dropped from 3-4 minutes to under 30 seconds.

Step 4: Configuring Ephemeral Runners (Pod Lifecycle Mysteries)​

Understanding the Magic​


Each runner pod is ephemeral, meaning it has a complete lifecycle:

  1. Creation: ARC sees a queued job and creates a pod
  2. Registration: Pod starts up and registers with GitHub
  3. Execution: Receives and executes the workflow job
  4. Cleanup: Job completes, pod reports back, and gets deleted

Container Architecture​


Each runner pod actually runs two containers:

  • Main runner container: Executes the GitHub Actions workflows
  • Docker-in-Docker (DinD) sidecar: Handles container builds securely

This architecture provides isolation while enabling Docker builds – crucial for most modern CI/CD workflows.

Mistake #3: The Docker-in-Docker Discovery​


My next challenge was handling Docker builds within the runners. My initial approach was to mount the Docker socket from the host into the pods.

This worked beautifully in testing. In production? Not so much.

Security-wise, it was equivalent to giving every job root access to the host. One badly configured job could potentially compromise the entire node.

The better approach: Docker-in-Docker (DinD). This provided isolation while enabling Docker builds. No more security nightmares, no more compromised nodes.

Mistake #4: The Resource Allocation Disaster​


Confident in my progress, I deployed to production with minimal resource limits. "Let Kubernetes figure it out," I thought.

Bad idea.

Jobs started failing mysteriously. Pods were getting OOMKilled. The cluster was thrashing under memory pressure. I'd created a resource contention nightmare.

The solution required careful tuning:

  • CPU: 1 core request, 4 core limit per runner
  • Memory: 1GB request, 2GB limit
  • Storage: Shared cache access for all runners

Pro tip: Always set resource requests and limits. Kubernetes is smart, but it's not psychic.

The Scaling Sweet Spot​


After months of tuning, I found our ideal scaling configuration:

  • Minimum runners: 1 (always ready for immediate pickup)
  • Maximum runners: 100 (handles our largest deployment batches)
  • Scale-to-zero: Pods disappear when not needed

This gave us the best of both worlds: instant job pickup for small changes, and massive parallel capacity for large deployments.

The Results: When Everything Finally Clicks​


Six months later, the transformation has been remarkable:

Cost Victory πŸ’°

  • Monthly CI/CD costs: $800+ β†’ $200
  • 70% reduction in infrastructure spend
  • Predictable costs with no surprise overages

Performance Revolution πŸš€

  • Job queue time: 15 minutes β†’ 30 seconds
  • Build speed: 40% faster due to effective caching
  • Deployment reliability: Near-zero failures due to resource constraints

Developer Happiness 😊


The real win? Developers stopped complaining about builds. Feedback loops became fast again. People could stay in flow instead of context-switching while waiting for deployments.

The Antifragile System​


What we built isn't just cost-effective – it's antifragile. When traffic spikes hit, it scales up. When nodes get preempted, pods reschedule. When builds fail, we have granular logs to diagnose issues quickly.

Each failure along the way taught us something valuable:

  • The firewall issue taught us to plan networking carefully
  • The storage problems taught us the importance of shared state
  • The resource disasters taught us the value of proper limits

Every broken deployment made the system stronger.

Should You Make the Jump?​


If you're spending $500+ monthly on GitHub Actions and dealing with queue times, self-hosted runners on GKE might be your answer. But go in with realistic expectations:

  • Initial setup time: 2-3 weeks for a robust implementation
  • Learning curve: Steep if you're new to Kubernetes
  • Ongoing maintenance: You're now responsible for infrastructure
  • Cost savings: Significant, but not immediate

Start simple. Get basic runners working, then add complexity gradually. Monitor everything. And don't be afraid to fail – each failure teaches you something you couldn't learn any other way.

The Bottom Line​


Six months ago, I was paying $800/month to wait in GitHub's queue. Today, I'm paying $200/month for instant deployments and custom runtime environments.

Sometimes the best solutions come from the problems that annoy you most.

If you enjoyed reading this, connect with me on Linkedin


Building your own CI/CD infrastructure? I'd love to hear about your journey and the spectacular failures along the way. They make the best stories.

Follow me for more CI/CD war stories and Kubernetes adventures!

Continue reading...
 


Join 𝕋𝕄𝕋 on Telegram
Channel PREVIEW:
Back
Top