Platform Engineering for the uninitiated

Chiranjib · Friday at 9:04 AM

"Oh you do platform engineering, you mean CI/CD and pipelines?"

This was my friend's response when I told him that I'm a Lead Platform Engineer now! SMH. I don't really blame him though, off late I find myself struggling to answer crisply when people ask me what I do for a living. The easy answer of course is that I am a developer. I have a confession though - I've been in tech for a while, have worked on software development and delivery, but when I started work as a Platform Engineer, it made me feel as if I've been living under a rock!

I've come across quite a few people who think that DevOps (and even QA) is a luxury and not a necessity. Convincing them on the value proposition of having these bedrocks is an uphill battle, but that's not what this article is about. I'm flushing my thoughts in here, with just the intention of spreading awareness and maybe help a person or two if they've been contemplating about this line of work.

Put simply, a Platform Engineer creates a Platform, thereby offering Platform as a Service (PaaS). My first taste of this was about a decade ago, when Cloud technologies were still in the process of being adopted. My team was working on creating tooling for our Operations Team, who were not developers but were solely in charge of managing deployments. The software had a few flavours - Windows or Linux VMs, SQL Server or Oracle PL/SQL databases, multiple application server stacks based on services selected (containers were not yet a widespread thing) and any combination of these could go on AWS or Azure. The Operations team were going crazy trying to manage deployments and upgrades, and our job was to make their jobs easier. We started developing bash and powershell scripts, but then quickly pivoted to developing a webapp instead that would help one deploy any combination of the aforementioned flavours - an Internal Operations Portal, if I may call it that.

At this point, I expect you to raise a concern - AWS and Azure provide PaaS themselves, why not use it as is?

Well, you would have a valid point; but here's the thing - I've worked with a few programming languages, followed waterfall and agile methodologies, used subversion and git, even tried to understand Jira and came to the conclusion that the most complex of them all is - people! Teams in charge of an organization's infrastructure and processes are expected to put in the effort to create instructions, knowledge bases, playbooks, et al., but getting people to read and follow them has proven to be a Herculean task - and this is just the beginning of the answer to the why.

Over time, I've learned that YAGNI is not just a cool sounding word, KISS is not just an act of love, DRY is not just the opposite of wet, and SOLID is not just a state of matter. Every developer should ideally live by these, and master design patterns that facilitate these. As cloud technologies keep getting more complex and diverse, cross team functions start becoming hazy, and it gets increasingly difficult to achieve consistency and predictability in a software setup. Throw in security to the mix, and conversations quickly shift away from the application related code. This warranted a separation of concern between software development and software delivery.

The solution to this problem started with setting up different teams for both - and ClickOps was coined. As cloud technologies evolved, people realized that it was getting increasingly difficult to keep systems in sync given the room for human error. Naturally, it evolved to the adoption of scripting based pipelines, and it led to the birth of DevOps. This bridged the gap between development and operations quite a bit, and was based on the idea of you build it, you run it, which inadvertently caused shrinkage of operations teams. It was progress alright, but still left techies yearning for something more. Application developers were getting overloaded with the cognitive load of managing infrastructure, and it was getting in the way of product development. Besides this, the complexity of application development was shifting from monoliths to microservices. Upholding the need for separation of concern, technologies evolved and containerization paved the path for taking things to the next level. Docker became somewhat ubiquitous and with the advent of containers, came the need for container orchestration. Google looked over their shoulders and donated Borg to the world, which came to be better known as Kubernetes or k8s. This is where things started to get really spicy, and the cognitive load on developers quickly started to multiply exponentially.

The DevOps way had already seen adoption of Infrastructure as Code, and tools like Terraform, Chef, Pulumi, etc. enabled us to think of our cloud resources as configurations and code. Cloud Engineering needed to evolve along similar lines, and the elephant in the room was - a need for standardization, enter GitOps!

Some of the brightest minds came together to set up the Cloud Native Compute Foundation and championed the concept of GitOps. This brought about yet another major shift in developer mindsets, and allowed techies to be more declarative with their infrastructure and focus solely on the what; the responsibility of how was abstracted away with the new technologies on the horizon. With Kubernetes widely adopted for container orchestration at scale in the cloud, Helm surfaced as the package manager for deployments to k8s clusters. The packages came to be known as Charts and could be deployed with predictability and consistency. This was a step in the right direction, but still included a little bit of ClickOps. Thanks to CNCF yet again, tools like Flux and Argo CD alleviated that pain aptly, and it became possible to manage Helm deployments in a declarative manner. As one can tell, this is already a lot to deal with for a developer who's supposed to write code for implementing business logic.

Software delivery was getting code heavy very quickly - and that is what brought about the need for Platform Engineering. This was a natural step in the evolution from DevOps, and there are multiple discussions online that DevOps is dead! I wouldn't quite agree though, but my view is that Platform Engineering is the wealthy father that adopted the overburdened DevOps (and it's good friend Site Reliability Engineering). Let's have a closer look at this new father figure at the horizon, and understand the peculiar mix of skills that is required to be successful:

Infrastructure Operations: familiarity with the public cloud platforms and other IaaS services is a must, which will form the foundation
Software Development: there would be requirements to develop services for abstraction, and packages for extending tooling to the application code
Product Management: nobody likes to use a bad product, and one must not forget that we are doing this in the first place to make developers' lives easier; we need to strive for the least possible friction and define the Golden Path for developers and it needs to be well documentated and transparent

With the understanding of roles to play, let's try to understand the responsibilities now, I've tried to depict them below:

At the heart of everything, there needs to be security, and the systems in general need to be Secure by Design. Concepts like Role Based Access Control and Attribute Based Access Control can not be an afterthought, but need to be baked in from the start. Once artifacts start getting generated at scale, and open source packages keep getting integrated in application code, all the guardrails like License scanning, Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), and Software Composition Analysis (SCA) need to be in place to stop malicious code in the integration phase. Additional artifacts like SBOM (Software Bill of Materials), test coverage reports and scan reports need to be published for the artifacts being generated during code integration.

Once all the important tech is in-place to enable software delivery, the necessary augmentation is what comes next. Documentation is a non-negotiable demand of the craft, and tools like PlantUML and Mermaid have enabled maintenance of Documentation as Code (Docs as Code).

Smart instrumentation for Site Reliability Engineering is of utmost importance now, because with so many moving parts in a microservices based architecture it is otherwise near impossible to identify problems. Technologies like ELK and LGTM+ in combination with OpenTelemetry provide respite in that area.

This is of course not an exhaustive list of what's already existing, and with Agentic applications on the horizon, the scope is only ever going to increase - think AIBOM and AI vulnerability scanners. Other possibilities might also include setting up internal MCP servers to enable small language models that power AI-native applications.

I'd like to conclude by saying - it's a lot, and it can get very overwhelming very fast, but it's important that we keep calm and resist the urge to boil the ocean on this journey.

P.S. Before all of this, we must evaluate if the product we're building it for has an actual need for these things at scale. Some food for thought -
This post by David Heinemeier Hansson

Continue reading...

Platform Engineering for the uninitiated

Chiranjib

Guest