Self-Hosted GitHub Runners on GKE: My $800/Month Mistake That Led to a Better Solution

Ogonna Nnamani · Wednesday at 6:18 PM

So, It's 3 PM on a Friday, your team is trying to push a critical hotfix, and GitHub Actions decides to put your build in a queue. For 15 minutes. Then 20 minutes. Your deployment window is closing, the stakeholders are breathing down your neck, and you're watching your GitHub Actions bill climb past $800 for the month.

That was my reality six months ago. And like most problems that keep you awake at 2 AM, it started small and innocent.

The $800 Problem That Kept Getting Worse

It began innocently enough. Our team grew from 3 to 15 developers, our deployment frequency increased, and suddenly our GitHub Actions usage exploded. What started as a manageable $100/month became $400, then $600, then crossed the dreaded $800 mark.

But the cost wasn't even the worst part. The worst part was the waiting.

During deployment rushes, builds would queue for 10-15 minutes. Developers would start their builds, then go grab coffee, chat with colleagues, or worse – start working on something else entirely, breaking their flow. Our feedback loops became molasses-slow, and productivity plummeted.

I'd sit there watching the queue, thinking: "There has to be a better way."

Spoiler alert: There was. And it involved making some spectacular mistakes along the way.

Enter Actions Runner Controller (ARC): The Light at the End of the Tunnel

After countless late nights researching alternatives, I stumbled upon Actions Runner Controller (ARC). Think of it as having a smart assistant who only hires contractors when there's work to do, then sends them home when they're done.

Traditional GitHub runners are like having a full-time employee sitting at their desk 24/7, even when there's no work. ARC creates ephemeral pods – containers that materialize when a job arrives, do their work, and vanish when complete. It's beautiful in its simplicity.

The promise was tantalizing:

Scale from 0 to 100+ runners instantly
Pay only for compute you actually use
Never wait in GitHub's queue again
Full control over the runtime environment

Of course, getting there required navigating through my usual minefield of spectacular failures.

Step 2: Installing the ARC Controller (And My First Epic Fail)

Mistake #1: The Great Firewall Fiasco

My first attempt at installing ARC was... educational. I spent an entire weekend setting everything up perfectly, only to find that GitHub couldn't talk to my cluster. Webhooks were failing, runners weren't registering, and I was questioning my life choices.

The culprit? I'd forgotten to configure the firewall to allow GitHub's webhook IP ranges.

Face, meet palm.

This taught me my first crucial lesson: networking isn't an afterthought. When you're dealing with webhooks, your cluster needs to be accessible from the internet, and GitHub needs to be able to reach your ARC controller.

How ARC Actually Communicates

Here's what I learned about the communication flow the hard way:

GitHub → ARC Controller: GitHub sends webhook events when workflows are triggered
ARC Controller → Kubernetes API: Creates/deletes runner pods based on job queue
Runner Pods → GitHub: Self-register and poll for jobs to execute
Runner Pods → ARC Controller: Report status and job completion

The critical insight: GitHub initiates the conversation. Your cluster must be reachable from GitHub's servers, which means:

Firewall rules allowing GitHub's webhook IPs
Load balancer exposing the ARC webhook endpoint
Proper DNS configuration for webhook URLs

What Actually Worked

After my networking debacle, here's the proper setup:

Certificate Manager First
I installed cert-manager to handle SSL certificates automatically. This ensures secure communication between GitHub and our cluster - because webhooks over HTTP are a security nightmare.

ARC Controller Installation
The ARC controller gets installed in its own namespace (arc-systems) and acts as the orchestrator. It's essentially a Kubernetes operator that watches GitHub webhook events and translates them into pod creation/deletion actions.

Testing Connectivity
Before proceeding, I learned to test the webhook endpoint thoroughly:

Verify external IP accessibility
Test SSL certificate validity
Confirm GitHub can reach the webhook URL
Monitor webhook delivery logs

Building the Foundation: GKE Cluster Setup

Once I'd conquered the networking nightmare, I focused on building a solid foundation. The key decisions that made or broke the implementation:

The Spot Instance Gamble

I made a bold choice: run everything on Google Cloud Spot instances. These preemptible nodes can disappear with just 30 seconds notice, but they cost 60-70% less than regular instances.

"This will either be brilliant or catastrophic," I thought.

Turns out, it was brilliant. ARC handles preemptions gracefully – if a node gets terminated, pods simply reschedule elsewhere. The cost savings were immediate and substantial.

Storage: Where I Learned About Shared State

Initially, I ignored storage entirely. "How hard can it be?" I thought.

Very hard, as it turns out.

Without shared storage for build caches, every job started from scratch. Build times were actually slower than GitHub's hosted runners. My "optimization" had made things worse.

The solution involved integrating:

Google Cloud Filestore: 250GB shared volume for build caches
Smart cache organization: Structured by repository and branch

Suddenly, build times dropped by 40%. Sometimes the obvious solutions are obvious for a reason.

Step 3: Building Custom Images (Learning Docker the Hard Way)

Mistake #2: The 8GB Image Monster

My first custom runner image was... ambitious. I threw everything I could think of into it: multiple Node.js versions, Python 2 and 3, every CLI tool I'd ever heard of, and enough packages to power a small data center.

The result? An 8GB monster that took 15 minutes to pull on each pod creation.

Watching developers wait 15 minutes just to start their build was painful. I quickly learned that in the world of ephemeral pods, image size directly impacts developer happiness.

The Right Approach to Custom Images

After several iterations, I developed a strategy:

Base Image Philosophy

ubuntu-22-04: Lean base with Node.js, Python, and essential tools only
ubuntu-22-04-infra: Infrastructure-focused with Terraform, kubectl, and cloud CLIs
ubuntu-22-04-qa: Testing-focused with Selenium, browsers, and test frameworks

Size Optimization Lessons

Multi-stage builds to eliminate build dependencies
Careful package selection (do you really need that 500MB SDK?)
Layer optimization to maximize Docker cache hits
Regular cleanup of apt caches and temp files

The Sweet Spot
My optimized images clock in at 1.5-2GB and pull in under 60 seconds. The difference in developer experience is night and day.

The results were dramatic: job setup time dropped from 3-4 minutes to under 30 seconds.

Step 4: Configuring Ephemeral Runners (Pod Lifecycle Mysteries)

Understanding the Magic

Each runner pod is ephemeral, meaning it has a complete lifecycle:

Creation: ARC sees a queued job and creates a pod
Registration: Pod starts up and registers with GitHub
Execution: Receives and executes the workflow job
Cleanup: Job completes, pod reports back, and gets deleted

Container Architecture

Each runner pod actually runs two containers:

Main runner container: Executes the GitHub Actions workflows
Docker-in-Docker (DinD) sidecar: Handles container builds securely

This architecture provides isolation while enabling Docker builds – crucial for most modern CI/CD workflows.

Mistake #3: The Docker-in-Docker Discovery

My next challenge was handling Docker builds within the runners. My initial approach was to mount the Docker socket from the host into the pods.

This worked beautifully in testing. In production? Not so much.

Security-wise, it was equivalent to giving every job root access to the host. One badly configured job could potentially compromise the entire node.

The better approach: Docker-in-Docker (DinD). This provided isolation while enabling Docker builds. No more security nightmares, no more compromised nodes.

Mistake #4: The Resource Allocation Disaster

Confident in my progress, I deployed to production with minimal resource limits. "Let Kubernetes figure it out," I thought.

Bad idea.

Jobs started failing mysteriously. Pods were getting OOMKilled. The cluster was thrashing under memory pressure. I'd created a resource contention nightmare.

The solution required careful tuning:

CPU: 1 core request, 4 core limit per runner
Memory: 1GB request, 2GB limit
Storage: Shared cache access for all runners

Pro tip: Always set resource requests and limits. Kubernetes is smart, but it's not psychic.

The Scaling Sweet Spot

After months of tuning, I found our ideal scaling configuration:

Minimum runners: 1 (always ready for immediate pickup)
Maximum runners: 100 (handles our largest deployment batches)
Scale-to-zero: Pods disappear when not needed

This gave us the best of both worlds: instant job pickup for small changes, and massive parallel capacity for large deployments.

The Results: When Everything Finally Clicks

Six months later, the transformation has been remarkable:

Cost Victory

Monthly CI/CD costs: $800+ → $200
70% reduction in infrastructure spend
Predictable costs with no surprise overages

Performance Revolution

Job queue time: 15 minutes → 30 seconds
Build speed: 40% faster due to effective caching
Deployment reliability: Near-zero failures due to resource constraints

Developer Happiness

The real win? Developers stopped complaining about builds. Feedback loops became fast again. People could stay in flow instead of context-switching while waiting for deployments.

The Antifragile System

What we built isn't just cost-effective – it's antifragile. When traffic spikes hit, it scales up. When nodes get preempted, pods reschedule. When builds fail, we have granular logs to diagnose issues quickly.

Each failure along the way taught us something valuable:

The firewall issue taught us to plan networking carefully
The storage problems taught us the importance of shared state
The resource disasters taught us the value of proper limits

Every broken deployment made the system stronger.

Should You Make the Jump?

If you're spending $500+ monthly on GitHub Actions and dealing with queue times, self-hosted runners on GKE might be your answer. But go in with realistic expectations:

Initial setup time: 2-3 weeks for a robust implementation
Learning curve: Steep if you're new to Kubernetes
Ongoing maintenance: You're now responsible for infrastructure
Cost savings: Significant, but not immediate

Start simple. Get basic runners working, then add complexity gradually. Monitor everything. And don't be afraid to fail – each failure teaches you something you couldn't learn any other way.

The Bottom Line

Six months ago, I was paying $800/month to wait in GitHub's queue. Today, I'm paying $200/month for instant deployments and custom runtime environments.

Sometimes the best solutions come from the problems that annoy you most.

If you enjoyed reading this, connect with me on Linkedin

Building your own CI/CD infrastructure? I'd love to hear about your journey and the spectacular failures along the way. They make the best stories.

Follow me for more CI/CD war stories and Kubernetes adventures!

Continue reading...

Self-Hosted GitHub Runners on GKE: My $800/Month Mistake That Led to a Better Solution

Ogonna Nnamani

Guest

The $800 Problem That Kept Getting Worse​

Enter Actions Runner Controller (ARC): The Light at the End of the Tunnel​

Step 2: Installing the ARC Controller (And My First Epic Fail)​

Mistake #1: The Great Firewall Fiasco​

How ARC Actually Communicates​

What Actually Worked​

Building the Foundation: GKE Cluster Setup​

The Spot Instance Gamble​

Storage: Where I Learned About Shared State​

Step 3: Building Custom Images (Learning Docker the Hard Way)​

Mistake #2: The 8GB Image Monster​

The Right Approach to Custom Images​

Step 4: Configuring Ephemeral Runners (Pod Lifecycle Mysteries)​

Understanding the Magic​

Container Architecture​

Mistake #3: The Docker-in-Docker Discovery​

Mistake #4: The Resource Allocation Disaster​

The Scaling Sweet Spot​

The Results: When Everything Finally Clicks​

Cost Victory ​

Performance Revolution ​

Developer Happiness ​

The Antifragile System​

Should You Make the Jump?​

The Bottom Line​

If you enjoyed reading this, connect with me on Linkedin​