The Cloud Gambit

The Cloud Gambit Podcast unravels the state of cloud computing, markets, strategy, and emerging trends. Join William Collins and Eyvonne Sharp for valuable conversations with industry mavens that educate and empower listeners on the intricate field of innovation and opportunity.

All Episodes

The Cloud Gambit

A GKE Party with Nick Eberts

March 04, 2025 • William Collins • Episode 44

Send us a text

What happens when you get Eyvonne, William, and our special guest Nick Eberts in the same conversation? You get a GKE party! In this episode, we dive deep into the world of multi-cluster Kubernetes management with Nick Eberts, Product Manager for GKE Fleets & Teams at Google. Nick shares his expertise on platform engineering, the evolution from traditional infrastructure to cloud-native platforms, and the challenges of managing multiple Kubernetes clusters at scale. We explore the parallels between enterprise architecture and modern platform teams, discuss the future of multi-cluster orchestration, and unpack Google's innovative work with Spanner database integration for GKE. Nick also shares his passion for contributing to open source through SIG Multi-Cluster and provides valuable guidance for those interested in getting involved with the Kubernetes community.

Where to Find Nick Eberts

LinkedIn: https://www.linkedin.com/in/nicholaseberts
Twitter: https://twitter.com/nicholaseberts
Bluesky: @nickeberts.dev

Show Links

SIG Multi-Cluster: https://github.com/kubernetes/community/tree/master/sig-multicluster
Google Kubernetes Engine (GKE): https://cloud.google.com/kubernetes-engine
Spanner Database: https://cloud.google.com/spanner
Kubernetes: https://kubernetes.io/
KubeCon: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/
Argo CD: https://argoproj.github.io/cd
Flux: https://fluxcd.io/
CNCF: https://www.cncf.io/

Follow, Like, and Subscribe!

Podcast: https://www.thecloudgambit.com/
YouTube: https://www.youtube.com/@TheCloudGambit
LinkedIn: https://www.linkedin.com/company/thecloudgambit
Twitter: https://twitter.com/TheCloudGambit
TikTok: https://www.tiktok.com/@thecloudgambit

Nick: 0:00

What I'm really passionate about is making sure the things that we're building for Multicluster and GKE land upstream in the open source, Helping the team deliver some tools in SIG Multicluster, which is part of upstream Kubernetes right, and the purview of SIG Multicluster is to start to help build the primitives from Multicluster toolchains across upstream Kubernetes which we then use in the managed Kubernetes world.

William: 0:37

Welcome back, fellow nerds. I'm your host, william, coming to you from the depths of Mount Doom, or let's just say Mount Yamel, where I'm going. Lord of the Rings, here, instead of the rings of power, we forge manifest of infrastructure. And with me is my companion, co-host on this quest, yvonne Sharp, who, like Gandalf, guides others through the treacherous paths and perils of operationalizing multi-cluster Kubernetes and such how are you doing?

Eyvonne: 1:10

You should have asked me which character I wanted to be, because to me, the real hero of the story is Samwise Gamgee. So if you're ever going to pick a character for me, I want to be Sam be sam.

William: 1:26

Anyway, you know, I said harry potter and I pivoted to to lord of the rings, like in a in a in a moment's notice, because you said you hadn't read harry potter books I know um.

Eyvonne: 1:31

You see how bad that pivoted there I'm super culturally behind, like my husband, and I just started watching peaky blinders. You know I watched lost five or six years after it went off the air. So I get there eventually. I just have to have tons of evidence that it's worth my time and there is no evidence.

William: 1:49

I watched Lost five or six years after it went off the air, so I get there eventually.

Nick: 1:52

I just have to have tons of evidence that it's worth my time. There is no evidence in Lost. No, yeah, okay, that's true, and a little bit appropriately, as in the time that you spent watching it would be.

Eyvonne: 2:00

Lost. Yes, and that other voice that you hear is my friend, peer and colleague, Nick Eberts. Nick, welcome.

Nick: 2:10

Thanks for having me on. Good to meet you, William.

Eyvonne: 2:13

If you don't know Nick, he is a product manager, GKE Fleets and Teams at Google Cloud. And Nick, why don't you tell us a little bit about who you are and what you do? Google Cloud.

Nick: 2:24

And Nick, why don't you tell us a little bit about who you are and what you do? Yeah, I'm a human being, allegedly yeah.

William: 2:30

So you're not AI? No, you could be AI and we would never know right now.

Nick: 2:37

No, I have like the appropriate amount of phalanges and such. But yeah. So I've been in this computer business for too long, started with the Navy back in 2002. And I've had this weird progression I had degrees in geophysics and somehow I'm still on computers. Whatever, I spent some time, you know, moving up through the back end channels of what they call now engineering, platform engineering. Yeah, I was a sysadmin doing Linux stuff way back in the days and then worked my way through a bunch of jobs. Now here I am at.

Nick: 3:13

Google working as a product manager. I actually the place that Yvonne and I met was we used to be on the same team and I'm sure everyone who's listening knows what team Yvonne is on. She's on a special team, but.

Eyvonne: 3:27

But yeah, yeah, that that is absolutely right. The, the team that has has you know three different names in several different iterations since then, but still doing, you know, largely the same thing helping customers adopt the google cloud.

William: 3:39

That's what we do, so it's really nice being sales and product people get along so nicely. I love it yeah.

Eyvonne: 3:46

We try. You know there's, you know, some what we would call healthy conflict and banter and camaraderie, and I think you know we need a bit of tension between sales and products to get to the right place, you know. So we've all been there and I suspect we've all been on both sides of that conversation yeah, well, I mean, but like just to take a moment to acknowledge that role.

Nick: 4:13

Right, the it's not just sales, it's it's being sort of tip of the spear, highest escalation point of of sales engineering. Right, and so I used to do that job, so I totally respect it. But, like, as a product manager, there is nothing more valuable to me than someone who does that job and talks to like hundreds of customers and uses the stuff. It's like an invaluable source of information and feedback.

Eyvonne: 4:38

It warms my heart to hear you say that, Nick True.

William: 4:42

Yeah, so you mentioned the the platform, platform engineering words. You know, I think I read the other. I was reading like something, something I think it was gartner that was saying uh, it was like some high number, like 80 or 90 percent of software engineering orgs are going to have a team dedicated to platform engineering by like 2026 or 2027, which is pretty, pretty wild. Um, so what? What is this platform engineering? Isn't it just devops like what, what's what?

Nick: 5:16

what is the difference? So, like, how far down the turtle are we going to go? Because, like, what is devops? No, but so devops. I think we all know that. It's a philosophy, right, it's a mode of operations, it's not a bunch of tools like you can't buy a devops, even though probably my companies would sell you one if you would want one.

Nick: 5:36

So, uh, I think, like for a minute, like we've all been professionals in this industry for a while, right? So you like, the arc of platform engineering, I think, starts with, I don't know, we were racking and stacking hardware and trying to serve the needs of the business in data centers that we own, and then the cloud came out and we said, oh hey, we really shouldn't do that because then my job's not relevant. So we fought against it for a while. But then a bunch of people went out there shadow IT Amazon and Azure and Google and when they did it, they kind of like made these teams of folks who understood the API surface of those clouds right so they can write a bunch of automation and they could do things like atomically. So the developer and this engineer whatever you want to call them DevOps engineer work together to like quickly iterate and release software without having to deal with putting a ticket into the infrastructure team Turns out that that's cool for a little while.

Nick: 6:41

But you know, these enterprise organizations are huge. They have piles of different applications and teams. I mean, some of these companies we work with are literally like 15 companies. You know what I mean. Like they. I think, uh, there's a local company to be here in Georgia that actually has five CIOs, right, Cause they're that big. And so, um, then comes like the question of efficiency and governance and control. And so then comes like the question of efficiency and governance and control, and so you take these people going fast, sort of inefficiently, out on the edge, and you're like, okay, how do we allow them to keep on going fast but get some consistency and some economies of scale? And I think that's what platform engineering is born out of. I could tell you one thing, like when we write these product docs in Google, one of the things that I like to start with is like, what isn't this thing? And this is the controversial bit Platform engineering is not a UI, it's not just an IDP, and I well, IDP is not define what. What does IDP mean?

William: 7:44

Yeah, it's definitely not a UI or even like a service. It's a conglomeration of. I would say it's a conglomeration of different services that you're publishing for consumption by developers from some sort of centralized self-service platform, and you probably have a platform the acronym IDP though it's important here.

Nick: 8:03

I'd love to clear this here. It's not developer platform, it's developer have a platform. The acronym IDP though it's important here. I'd love to clear this here. It's not developer platform, it's developer internal developer platform. Yeah, and a UI is great. I think that, if like, because what is a platform is a product, right. So what do you do when you build products? You interview your users, find out what they want, take those things stack, rank them based on priority and then build them. So if a lot of your users want a ui and that would make their life better, then build it yeah you've got.

William: 8:35

I want to clarify something. You sort of said like the devops was happening kind of on the edge, kind of the hey, we're going fast, we have these automations, we're doing all these things. But we're you sort of okay, it's kind of disjointed. But then, like, are you kind of saying the evolution of that, in the right way in the context of a big organization, is what platform engineering is. It's almost like a progression, taking those practices and guiding principles and making it consumable to like a large organization.

Nick: 9:09

Yeah, that's great. You should market that. Maybe you could sell something. No, no, that's 100%. It's a large, a large section of the book is just basically about like why it's important, and it's much more related to org charts and organizing human beings um into pockets to deliver the right amount of value to each other than it is about, like again, tech.

Eyvonne: 9:35

it's never really about tech directly, I think well, and I think the the place where we find ourselves, like for those of us who grew up when we were racking and stacking physical servers and when you were, you know, installing, oh yeah, hi, like that's all of us, right, we are the elder millennial Gen X folks in tech, but you know, there were physical things in the physical world that helped us structure our organizations, right, physical things in the physical world that helped us structure our organizations, right, like we need a team that racks this server, puts in the rack screws, that cables it to physical switches. And I think, like, the place where we find ourselves now is that the systems, they aren't physical, they don't exist in the physical world. The systems, they aren't physical, they don't exist in the physical world. And so we need a new way to map our organizational structures to the work.

Eyvonne: 10:31

And it's not always as clear because there aren't physical things, it's, they're all most I mean, you go down far enough there are physical things, but most of us that are operating in cloud aren't operating on physical things. So we need a new way to think about how we're going to structure our orgs and how we're going to structure our systems. And all of that is somewhat ephemeral because we can't put our hands on it, and so to me, platform engineering is about OK, how do we develop those systems and structures and interaction surfaces for our people and the technology in a way that makes sense, when there's nothing for me to physically look at to understand what it is? I feel like that's sort of the problem that we're trying to solve these days in the broader industry.

Nick: 11:21

Yeah, for sure, and it's. I think it's like this mechanism that we have to help promote DevOps principles in a bunch of smaller app teams. Right them, a team self-service, right? You want to get out of their way, but you also don't want your business to end up in the paper because you got hacked, because blah, blah, blah, blah, secret and whatever governance guardrail didn't get put up.

Eyvonne: 11:52

So it's that, or HIPAA compliance, or PCI or any of those, and that we have the added complication that a lot of those regulations and auditing processes were built and understood in an old framework that doesn't necessarily map to how we deploy technology today. So that's the other challenge. A colleague of mine shared this great I believe it's a Churchill quote. He's like first we make our buildings and then they make us. And I feel like that's where we are, like we made this system of managing infrastructure and now it's shaped the next generation, even though it doesn't exactly completely fit. So yeah, fun, fun.

William: 12:36

I almost feel like this.

William: 12:37

So like this I don't know, I'm curious to get your thoughts here, nick. I don't know, I'm curious to get your thoughts here, nick but it seems like this shifts the outskirts of the responsibility from all these teams back to this centralized platform team. Essentially, like, the responsibility of managing infrastructure complexities lies with the platform team, so that the devs, they, can just focus on their building or applications and such. And, with that being said, it almost seems like a drop-in replacement for, like, I don't want to say enterprise architecture, but kind of like a new wave of enterprise architecture-y type thing, because usually enterprise architecture was like decoupled from the boots on the ground, like they came up with these. You know they're in the ivory tower and they're just writing these things to shape how the business is going to run technology. And maybe these things are adhered to, maybe not, maybe stuff was never updated, maybe they're on the ball, maybe they're not. But platform team seems like this um, they take that responsibility of the architecture and sort of marry that with the execution and the consumption as well.

Nick: 13:49

Yeah, I mean from my field days, field engineering, whatever I mean, that was what enterprise architecture was turning into. And so, blake, platform engineering is certainly older than the word itself. Right? We all acknowledge that we've probably been engineering platforms for our entire computer life. But, like when I would go into these companies, usually the job title of the person who's trying to figure out how to build the thing that they're going to share with everyone was an enterprise architect, and maybe still is.

Nick: 14:20

But I see the main flaw in enterprise architecture is too much talking, too much writing, not enough doing so. I hope that platform engineering gets those enterprise architects to start making their hands get dirty again and building principles that you expect the teams that are using your platform. You should be able to build, and I mean move fast, but obviously not quite as fast because it's much more risky. But you should be able to iterate and improve the platform right, and and you need actual human beings to do that. And I still don't understand quite the value of the person that's just writing a doc, hasn't touched the code in years. I think that there's value there. So I'm sorry for all you that do that for a living, but I can't.

William: 15:10

I just don't relate to it yeah, I mean, especially if you have your hands in the tech, like if you're, you have a direct impact on the hands that are in the technology. You really need to have done or actively work with those technologies, or else mileage is going to vary greatly.

Eyvonne: 15:27

Well, and in order to be able to do that effectively and not have your own grubby little hands in it, you have to be a phenomenal listener and be willing to hear people who have experiences that you don't have. And that's incredibly difficult for all of us, and actually it's just much easier for you to sit down and figure the thing out than to try and suss it out from different sources without that hands-on experience. So, yeah, I mean.

Nick: 16:04

So one of my new overlords here, um in, you know, in the container runtimes org and google, is someone I've known for a while, um, gabe monroy. Just go. A lot of people in our space know him, um, and one of the greatest things about gabe is and I'm not just kissing his ass, I don't do that but one of of the greatest things about Gabe is that, like I didn't get that. Ooh, siri, hush your mouth. One of the greatest things about Gabe is that, like, if he finds out that you're falling behind or something or need help, like he's in the pull request. He's in the repo looking at the code, like trying to help you out. And we're talking about someone who is, you know, two steps down from Thomas Kurian right. He about someone who is, you know, two steps down from thomas curian right. He's got honestly no business doing it. Except for that, he's got every business doing, in my opinion. I mean that's really cool. I I appreciate that someone at that level still cares and understands what we're dealing with, you know I love that.

Eyvonne: 17:01

That's great let's talk about some of the pieces that make up this platform that we're talking about. I mean, you are on the Cloud Runtime's team. A little alliteration there is getting me this morning. Let's talk a little bit about that work that you do there. I know one of the questions that you had in the notes is really, when we're talking about Kubernetes, how many clusters is too many? Can you ever have enough or too many? Do you ever have too many Kubernetes clusters?

Nick: 17:30

Yeah, yeah, so that's I mean. At the root of it all is just managing infrastructure and Kubernetes is. I think I've been in the community for almost a decade now. I just have to say that I need to go back five years and see if I can find that job that was out there in the world that required 10 years of experience when Kubernetes was only five years old. I now can actually apply for that job, so let's go.

Nick: 17:57

But, yeah, so Kubernetes is like an operating system or an API for infrastructure. It's super handy, but it's not exactly the abstraction you want to give a developer who's writing business logic. They don't want to understand and try to comprehend and reason about all those things. It turns out also that platform teams to some degree. Maybe their life can be easier if they have to understand less about all of the idiosyncrasies of Kubernetes and the VMs underneath it and the storage. Like you want to know enough to get your job done and you want to make sure you're not getting overcharged. But maybe you want to manage service right. Maybe you don't want to have to update the control plane on your own or data plane, and so that's. You know that's what gke is. It's a managed service, um, but then it turns out that, like a platform, I have yet to see a platform team build a platform of one cluster.

Nick: 18:56

Right, you, you ultimately you'll have. Either you're going to have environments you know what I mean. You don't want to, uh, ship your dev code in prod although I think it's pretty cool if you can pull that off, um or you have, like, regional constraints or high availability needs, where you want to have multiple um sets of infrastructure in different regions for either bringing your application close to your end users or for fault tolerance. But you ultimately end up meeting more than one cluster, and kubernetes itself is not. It doesn't. Its world ends with the cluster that it lives in. Like, that's it. It doesn't understand other clusters. We've tried cube fed, rest in peace, like. We've tried a lot and it's failed every time. Um, and so my job is to try again. And so the things that I'm working on are trying to stitch together multiple clusters and treat them as one, but the approach that we're taking is not to create a meta resource that represents both, as much as to create an abstraction that allows you to treat both clusters as the same thing.

Eyvonne: 20:05

Okay, right, so, if I'm able to, the answer is always another layer of abstraction. It's always the answer yeah.

Nick: 20:15

But like, okay, I know y'all are in networking guards or at least I know for sure yvonne is. But one of the hardest, one of the most challenging things with multiple clusters is how do I get like my pod in cluster one to be aware of of a pod in cluster two in another region? Right, if you attract, if you make the networking discoverable across both, then you've all of a sudden now increased the availability of and the accessibility of that pod you're.

William: 20:45

You're bringing me back some ptsd here. So like my first introduction to gke, like you really, I just got like a blast of information I totally forgot about. But so like we had a professional services engagement with Google, I mean it must have been hold on five, six, maybe it was like seven or eight years.

Eyvonne: 21:05

We are a very different place now. I'm just going to say that. But go ahead, tell your story.

William: 21:09

Sheesh, I can't really, it's just crazy. It's been that long. But anyway, like I'd worked with like EKS, aks from AWS and Azure respectively and I can honestly say GKE at that time was quite a different experience and in a lot of positive ways really. So we started with and we started with cloud identity. You know, we had it tied to like our unique, like a dns namespace for like integrating with all the the active directory stuff, and then we had our, you know, we went on and did like the folder and project hierarchy, the um the identity and access management stuff, the good old networking, the kubernetes and then all the logging, monitoring, monitoring, stack driver stuff.

William: 21:49

But one thing that Google provided at the time with guidance on with the Kubernetes stuff, with GKE, was really helpful in actually how we went back and optimized other Kubernetes, environments and other clouds at the time. But it was not overcomplicating what we deployed versus what our needs were at the time. And what I mean by that is like we had a multi-tiered structure that balanced, like the risk and efficiency and such and like. At the very top it was as simple as we had production and non-production cluster allowing for like these very distinct configurations for how we separated risk and security. And then you went one layer down, like layer two was, um, like the business domains, like lines of business type stuff, and each domain got like a.

William: 22:42

I think the way that we had it canned out was like each business domain or line of business got one production and one non-production cluster and then the third level was like the individual namespaces within each cluster for different products, or, dare I say, what we consider microservices. And this was a great feather in the cap of developer experience really, as teams could work independently with their, their designated namespaces, you know, minimizing blast radius, making security, happy. But this was a big win for us because the way that we were doing it elsewhere is like developers were sort of they had the, the control and they were like, okay, each team gets a kubernetes cluster and it's like, hey, we have like hundreds of teams here, we've got you know almost a thousand applications. Like how many kubernetes clusters can one organization have?

William: 23:33

come on and doing it this way actually helped us and you can't exactly, you know, and it gets expensive because it's all compute, you know, on the back end at the end of the day. So, yeah, it was a very I'd say it was a pretty positive experience. I'm sure it's a lot changed, or a lot has changed. And I guess my question is so you're part of the product team that works directly on gke, but gke is big right, so what specifically do you work on with gke?

Nick: 24:06

Yes, gke, I think we have something like 25 PMs more or less. So I am in core GKE and I product manage fleets and teams. So fleets is our, you have more than one cluster solution and then teams is sort of this abstraction as a service that we provide, and then Teams is sort of this abstraction as a service that we provide. So you were talking about oh, my developers in this previous light that I had all had their own namespace. Teams is like namespace aggregation across multiple clusters as a service, right?

Nick: 24:44

So one of the tough problems or yeah, that customers tend to have to think through or platform teams have to think through, is like how are we going to do tenancy, right? Are we going to do tenancy single, single tenant, single cluster, single tenant, multi-cluster, multi-tenant, multi-cluster? Um, and the answer is at least what I've seen out in the wild is that there's never one answer, right. Like if your organization is big enough, you have like a spectrum of tenancy. You've got some business critical app that's got dedicated clusters to it, and then you've got this like junk drawer cluster that's got all of these other tools and smaller services running in it, and then there's some, there's some like spectrum of other clusters in between.

Nick: 25:32

And so what I think is, even if you don't use Teams, I think one good thing for a platform engineering team to think through is how do we not bury ourselves with tech debt by being too over-opinionated on the tendency model up front, and how do we build this to be adjustable over time based on the needs of the business, how we want to bin, pack and unbin pack things over time, and so that's what Teams is out to do. It's just like this logical container of logical namespaces that then can get bound to clusters on demand. So today you want it bound to cluster A, it's there. You want to add cluster B? Tomorrow, you add it. You want to add more applications to cluster a and b. You then bound, find those teams to the cluster and you sort of have that abstraction of flexibility got it?

William: 26:23

what? What are the cases? So, like we had the whole, if I remember, like before we even, I think, deployed a single cluster, the whole big enterprise thing of hey, we want to account for every possible scenario before we hit, go and have an MVP or do anything. We just want to account for every enterprise architecture again Got it.

William: 26:45

Yes, pretty much Exactly. We had this idea of, okay, well, what happens when we need new clusters? Like I remember we got, we had a certain team that got stuck on this. We were stuck on this for like weeks. It was like, oh, but what happens when we need to create a new cluster? The this is going to happen and that's going to have to happen, what this, what that?

William: 27:07

And I think, if I remember correctly at the time, like it was like network exhaustion, pretty much like, okay, we ran out of ips, so we need a new cluster, I think. And then it was um, like no limit, and they were so concerned about the node limit, even though the node limit, even at the time, was like ridiculously high, yeah, um. And then I mean, I guess the last obvious one would be like isolation, like instances need isolation based on security, you know, based on, you know, however, your business needs to run, like maybe some clusters are in you know, hippa sort of environments and others are just production with no protected data or no, nothing really important as far as like customer impact. But I think those were the three things, if I remember network stuff, no limit stuff, and then isolation. But are those like reasonable criteria still these days for needing a new cluster, or what does that look like in 2025?

Nick: 28:12

I think it depends on where that cluster is being made. I can say proudly, if it's GKE, I do not think number of nodes is going to be an issue, like we just released support for 65,000 nodes last year at KubeCon NA in Utah. So I don't think nodes is the thing we do have. Just to be real, like we built support for 65 000, that's because we have demand. We have customers, like three of them that that are using that many um, but it's a very specific use case. But I, I think what I I think our job is to be able to build a product for our end users, the platform engineers, that removes the need for them to have to consider those low level limits, like I would prefer them to just focus on the compliance aspect of it, or like the business reason for them to have more than one cluster than an actual physical limitation of one, and over time that's becoming more reasonable. Like one of the things that we've done on gke is I don't know how familiar you are with kubernetes, I think you are. So I'm just going to say, like we took scd um right and and we're replacing it with spanner right, so we've removed the bottleneck of like a database running on literally in vms in a control plane and then use our own managed service that runs Google search and ads all this stuff so that greatly increases the request per second the API server can handle. So that request per second on the API server was the bottleneck, so we're making a new bottleneck now. I'm not exactly sure where it's going to be yet, but again, I think the idea is that we're trying to remove those limits from things you have to consider. Now there's always other limits, right.

Nick: 30:03

Ips, ipam is just, especially if you're a legacy company. It's just tough. It's just, especially if you're a legacy company. It's just tough. We are rolling forward with IPv6, but like with the actual IPs allocated to the pods within the cluster, but then they have to like NAT, everything so it understands the world outside. It's not easy, but that's the goal is to continue to make the size irrelevant. My personal dream would be like, outside of business and governance rules, you have a cluster per region, right, because they could be so big that it doesn't matter. And then I say that now, but also there's like the blast radius of change. So we need to do more work to make that real.

William: 30:53

um, but yeah, one ipv6 thing is huge. Uh, we'll get into that in a second though, yeah, yeah, so continue I think, uh, I I.

Nick: 31:03

Often it was full disclosure, like adhd. My brain's everywhere. So I'm probably the worst interview in the world. It's hard to try to keep me on track. But one of the things that I think will help you forget what I'm doing what I would do if you were you platform engineer, regardless of which cloud I'm using, even if I'm using on-prem is I'm trying to build clusters fungibly. I'm trying to make them just. It's the old pets and cattle analogy. I want them to be cattle, I want them to be replaceable. I want to add a cluster with the right label, have it come up to the right state, know what its job is and do work, and I want to be able to remove a cluster when I need to. And if you do that, size doesn't matter anymore, because if you need more, you add more. Right, just like you're scaling out clusters clusters like we used to scale out VMs, or like Kubernetes does under the covers.

William: 31:57

Awesome, yeah, the IPv6 thing, and you're absolutely right, the pets versus cattle thing, like we're still. I talked to so many folks in the network automation community. They're still having challenges with this in large companies where they have names for individual servers. They're not. They're using automation like very limited and they're still.

William: 32:21

It's the whole hey that we still have pets in 2025, and I think that going up to the cloud does make it easier to do these things, though You're not relying on as much iron, as it were, but even with, like you were saying, like IPv6 and the natting scenario and that's just you can. Like I think the what most companies are doing is they're embarking on this, at least on the consumer side is they're dual stacking. They're doing as much as they can with. But even when you dual stack like that introduces complex like no matter what, it's complicated Like you really want like a okay, we're just going to have Greenfield with IPv6 only, but with you know, when everything has an IP address and has to talk with the whole system, you can't one doesn't simply do that.

Eyvonne: 33:10

Well, there will all. We are still at a state in the industry where there is one key critical component somewhere in the data flow or somewhere in the path that doesn't support it, or there's some constraint that still is keeping us tethered to v4. And we've been talking about this for so long. I mean, I guess someday we will get there, but yeah.

Nick: 33:42

And then you have the invocations that need to talk over TCP and they need to know the outside of the Kubernetes service IP address, and then your whole network diagram just gets really fun. And that's everything, your house of cards.

Eyvonne: 33:58

Your house of v6 cards has just fallen.

William: 34:02

Speaking of your ADHD that you mentioned earlier, which I'm about to have, and go off complete topic. But you mentioned Spanner earlier. Spanner is one of Google's many gazillion data products. Isn't that the one that automatically can do transactions globally? It has the multi-region automatic replication. Is it that one? It's SQL right.

Eyvonne: 34:28

The way I like to describe Spanner is, first of all, spanner is the data system that runs all of the underlying metadata for youtube. So if you, if you think about all that's required to make that work, it is a globally consistent database, um, with, uh, incredible uptime. It's five nines. It's been a while since I've looked at the data sheets, but it is cap theorem defying because of what it does with the atomic clock. Cap theorem is still real, right?

Nick: 35:06

I know Nick is shaking his head at me it does not, but when you, it does not. Cap theorem. I'm sorry Like I can't.

Eyvonne: 35:13

I love, say I say defying, because it it doesn't prove it wrong, but but what it does? It isn't. Some implemented Some Really clever Atomic Clock Magic, I don't know, that makes it appear to cap theorem is still real. Don't hear me say that cap theorem is not real so like.

Nick: 35:40

I mean it's amazing, so I don't want to take away from it. It is amazing and it's the forethought. To install and create a system that syncs atomic clocks across all of our data centers is what makes it possible. So all we're saying with Spanner is we're guaranteeing atomicity right at distance. That said, network partitions are still real, right, and so the way that. But the risk of an actual outage is almost imperceivable.

Nick: 36:11

So we kind of say, like a defeats cap theorem, but like the idea here is I'm writing, it's a. You know, it's a relational database system. It has acid semantics and you commit a change. We're just saying we're going to make sure that that change is committed and timestamped across all of the replicas, across all of the regions that are in that spanner shape, um, and then, if one of those, if there is a primary this is why I say it's not really defeating there is one primary at any given point in time but we react and adjust and you almost don't feel it because it's all on our network, it's super fast and we, um and we guarantee the consistency, um, we and we guarantee the consistency, or at least guarantee the ordering.

Nick: 36:53

So in your app code you have to then deal with. Deal with conflicts, right. That's the part that people don't really talk about that often, but you still have to be able to. We can say, hey, there's a conflict, and then you decide whether you want the latest or the whichever version of that transaction commit, but it's amazing.

William: 37:11

Yeah, it is amazing. I mean because when you think about data, even well, every cloud and thing is different. So you've got like availability zones that you have to worry about, and usually those are, of course, going to be synchronous replication across data sources. But then the moment you go multi-region and your whole infrastructure and application design, you're thinking about the moment you go multi-region and your whole infrastructure and application design. You're thinking about the you know, asynchronous replication, how that's going to work, active hosting across multiple regions or active standby, like there's so many different things that go into the architecture and data and data gravity is usually the challenge there, because, yeah, I mean, most applications are at least trying to go stateless as possible at this point, even though a lot of them do have a lot of state baked in inadvertently. But, yeah, it to take the complexity piece out of the database portion of that whole process seems like a really awesome win.

Nick: 38:08

Yeah, I mean just as a developer or a platform engineer, the fact that I could just provision this database system and not have to program how and when it fails over. I just pick a shape right and I understand, like the trade-offs for that shape, but with cost and availability, and just go as opposed to like back in the day. Oh man, I have some more stories from my days at microsoft of setting up. They would I mean these folks would set up sql databases in two different regions and then use um like third-party technology to sync them or log shipping or any of that stuff.

Eyvonne: 38:44

It was brutal. Yeah, been there done that? Yeah, um, and and, and. You know you have to need that kind of scale consistency to be able to use. Like you know, it's, it's, it's, it's a premium service, but it's also incredibly valuable and it takes a huge amount of engineering overhead effort, um, from your, from your engineering design teams. You just don't have to think about it and that's not something that we've been able to do for a very long time in this industry.

Nick: 39:21

Y'all there's like. So we have very brilliant engineers at Google, as there are everywhere, right? Some of them are young, fresh out of college, and they've only worked at google for a couple years, so their only real life experience with having to deal with consistency is spanner right and I and sometimes I'm on calls with them like do you know how good you have it, do you know?

Eyvonne: 39:47

like you just have a bless your heart moment. You have a bless your heart moment, don't you?

Nick: 39:51

yes, not to be the bless your little heart moment, which my mother-in-law's super southern, so I understand what that means yeah, yeah um yeah, it's like oh, isn't that?

Eyvonne: 40:05

yeah, man, yeah it's, but but then you're starting that I walked to school uphill in the snow, both directions kind of a conversation, so you're becoming that guy yeah, but in this analogy like 80, because the database consistency yeah, in this analogy, like 80 to 90 percent of the other folks that are in their position, working for other companies, are still walking uphill in the snow that's right, that's right

William: 40:28

yeah. So what? Like this whole, I have a lot of questions. Just I know what the answer to most of these were back when I was doing this, but things have changed. What, typically, is it still the common thing to, you know, for a large organization that's trying to modernize, or even like a medium sized organization got lots of applications, are they still trying to profile? Okay, like we have this set of applications, it's like a good fit that we want to modernize with GKE. Then we have some sort of process we go through which is the applications evaluated. It's broken apart, the dependency mapping, all these things. Okay, it's a good fit. Now we're going to migrate. It Is that kind of how the process works.

Nick: 41:20

Yes, I think there's still a good bit of customers doing that.

Nick: 41:24

I do think that there's there's sort of been like, honestly, there's been like a little pause on cloud is the best thing recently. So a lot of people who are still in data centers are like empowered again to be like, no, we shouldn't change anything. So I don't know if there's as much of a effort across all these enterprise customers to to migrate all the things. Um, but yeah, I mean there's still a triage, like if you were starting, if you're a big company and you're just now starting which I think most have already started like you would pick the low-hanging stateless applications or something to experiment with, then you know, then you have to figure, you have to go down the stack of complexity and and work out how and when you want to migrate these more difficult applications. And then I just want to put this on wax I'm still here to say that if that application runs on NET that's not supported on Linux, leave it on a VM, for the love of God don't put it on a container.

Eyvonne: 42:26

That's what I was going to say. I think several years ago, there was a movement to just modernize all the things. Everything should always run in a container. That's the way of the future. This is the way everything's going to be, I think the more organizations have gotten in.

Eyvonne: 42:41

First of all, I think some of them have learned how much they don't understand about their applications and how they run. And so if the application is natively meant to run in a container, if it's a code that you're writing, that you have access to, that's going to be long-term strategically valuable for your business it makes a lot of sense to replatform that onto Kubernetes. There are some applications, though, that do what you need them to do. There's not a whole lot of strategic value to modernize and update that particular application, and in those cases, I think what Nick is saying is like leave that thing on a VM, because you're always doing an opportunity cost calculation of what makes sense and where you need to put your effort.

Eyvonne: 43:26

And so you know, I think, if you think back to the very early days, there was a set of folks who just believed that serverless containers were going to be the only thing people used in the future. The world's just more complicated than that, and so the thing I would recommend for enterprise folks more than anything is know your application stack, understand what makes sense for it, for your specific application stack, because as much as we have opinions, you're going to know your stuff better than anybody else. Anybody else, including you know the Deloitte's and the E&Y's and the. You know all of the consultants and you need to understand that and then look for expertise. Knowing your applications and your requirements to understand where it best should live that would be my guidance.

Nick: 44:18

But, yvonne, I don't want to know these things, I just want a magic button.

William: 44:23

You must be an executive, then You're a new fee now.

Nick: 44:31

That's great. You put it better than I could. Um, I think I've been disconnected a little bit from it, frankly, as now I'm so hyper focused on gke and stuff, like when I I had to answer more eloquently stacked on my brain when I was dealing with those things every single day for five, six years. Um, I will say like another, just taking it back to platform engineering for a second, I think another reason that the existence of this homegrown platform is a direct response to no company can build a platform that absorbs a significant amount of the workloads out there. The best that we've ever had was Heroku and there was a very, very specific style of applications. You had to run to use heroku. That was the best, right it was. I still get happy when I think about like heroku push or whatever lf push or but it's like that experience was second to none.

Nick: 45:27

It yeah, yeah, and but like you can't do that for a legacy application, you certainly can't do that for you know, dot net 3.2, whatever web forms thing you have set up. So, um, the problem with the paz market the companies are out there trying to build paths and I some of them are my buddies like building wasm as a service, right is that it only? It's only going to handle Greenfield use cases for most of the time, and I don't have like a statistic anymore. I would love to know what it is, but I would go out on a limb and say probably 70% of the computers or the applications that are running out there are not Greenfield, they're old, they're making money and they're probably 30% of them were written in cobalt or something I don't know.

Nick: 46:16

So Kubernetes, though what it did is, it gives you like a higher lowest abstraction, I guess right. So it can't do everything, but it could do a lot more than Heroku could, and so you could take this common infrastructure and then build different interfaces on top of it. Like people are basically building their own Heroku style paths on top of Kubernetes so they can have the new stuff, have the same security model, governance model, service discoverability as the old stuff and that's the thing that Kubernetes brings. That, I think, makes it so attractive. It's like you could kind of put a lot here.

William: 46:53

Can't put everything here, don't put windows there, um what, you could put a lot uh, love the windows comments like I, yeah, like uh, I don't even want to, it's a different wait, wait, you have a war story about windows containers.

Nick: 47:10

Those are always fun. I have a.

William: 47:12

I've had a few fights. We went through a period it was probably about four months of just there's just some things you don't do. If we're technologists and we have experience, there's certain things it's kind of like stretching layer to yourself over homegrown services, over like a dci, when you haven't set up infrastructure properly across both ends and haven't considered the distance between the data centers and all these things and you know the shared fate thing becomes pretty daggone, terrible. It's like hey, we, we know that this is just not a good idea at this point you have shared fate and it's really bad for everybody.

Eyvonne: 47:54

is how that?

William: 47:56

As in the Internet's down, and like, ok, we actually put one sensor on this side of the DCI for the firewall and the other sensor on the other side of the DCI and another data center, because we were trying for high availability like a hyperscaler. Sorry, how did we not think?

Eyvonne: 48:13

this is, this is all trauma real quick.

Nick: 48:16

This is that trauma for the user out there like me. What is dci? I don't know. I'm probably data center interconnect. Yeah, just just yeah this is your working jargon. Okay, cool, sorry.

William: 48:27

All right, that makes sense now yeah, basically connecting two data centers and saying, hey, for high, we're going to put one thing in one data center and one thing in another and cross our fingers, and then we're going to delete, until the internet goes down for the fifth time Google database across it.

Nick: 48:42

Right yeah.

Eyvonne: 48:44

Yeah, we're going to manage state to cross it and use layer two and all that fun stuff. Yes, yeah, but that was William's analogy, that you kind of went down the DCI rabbit hole there. But talking about Windows is what started that.

Nick: 49:02

So Windows? Here's the thing. So I did work at Microsoft for five years and for a minute there I was tasked with trying to figure out how to help customers take their Windows workloads and put them on not even Kubernetes, it was Docker, docker, swarm. But the dirty truth about most of the Windows containerization and I'm talking about older Windows applications, not newer NET Core and above that can do either Windows or Linux, so that's a whole nother conversation. But the older stuff, the stuff that needs ias, right, um, the, the root process of the container image is ias.

Nick: 49:44

So now, where I had a ias farm before, let's say it was like five, five servers and I was running 10 apps on it, but each server had one instance of IIS running, with those 10 apps running across it, sort of like containerization and bin packing. Now I have those same five clusters in a Kubernetes cluster running Windows server and I have 10 instances of IIS. So I'm doubling the processing requirement because the root of the container, the startup process for most of these apps was IIS. So my question was always like how is this? What are you getting out of this? It's not more efficient, no, or do you just want to be cool Because you're already not going to be.

Eyvonne: 50:29

But didn't we all get into this line of work to be cool?

William: 50:34

I kind of gave one notch. So one thing that my the only I mean honest. Can I honestly can say that my only positive experience ever on a windows machine was when they released windows subsystem for linux 2. Yeah, and that made the experience of running containers locally like. The performance was actually good. If I'm doing it on my Mac with ARM, the performance of running Docker because of the additional abstraction in the kernel is not good. It's not fast. Doing it on Windows subsystem for Linux 2 is actually very fast. I don't know how the mechanics of all that work and if it's really like, yeah, I don't know, but mechanics of all that work and if it's really like, yeah, I don't know, but the experience is really good, although I don't use Windows anymore, so I can't yeah, I think it's like Hyper-V provisioned into a Linux distro and then they built like a really nice, it's great to be honest, and a lot of customers, a lot of our end users out there, don't have the option to not use Windows.

Nick: 51:33

Right, it's like a corporate mandate because of active directory and all these things. So good for them that they have wsl. I'm just, I'm I'm a little bit privileged. Actually, I'm a lot of bit privileged because just look at me, but um, but g, but uh, like we use, I use g linux now. So I have, I actually have a laptop with Google Linux installed on it and that's what I develop on and it's amazing. I don't have to like, do any weird shenanigans, I just basically can run Linux processes which containers are.

William: 52:06

I'm so jealous. If I could use a Windows or a Linux machine, I would run straight Debian for everything. That's my. I've been using Debian since like the late 90s. It's my OS of choice. I love it. I went to Ubuntu for like a hot second. I actually started off at Slackware, originally Landed in Debian at some point.

Eyvonne: 52:27

Slackware was my first distribution too, yeah.

William: 52:29

There you go.

Eyvonne: 52:30

Back in the day.

William: 52:32

After using Slackware for a little bit I'm like, do I really like this Linux stuff? And then I hit Debian and Debian's just amazing. If I could run Debian on a machine for work I would be a happy camper. Unfortunately I can't and never have been able to, so it's back to the tricks. But at least now you can just ephemeral machine. You can spin up in the cloud and run something, connect to it pretty easily and do what you need to do. But it's nice to have that local dev environment and that local experience with Linux.

Nick: 53:06

Yeah, hey. So my ADH brain is kicking in. I feel like I mean, honestly, I could talk to y'all for like another three hours, but I'm sure we don't have that much time. There is one spot I guess I'd like to give yeah. So, like we talked about a lot of great stuff I'm working on like GKE, right, but what I'm really passionate about is making sure the things that we're building for multi-cluster and GKE land upstream in the open source. So, like a lot of the work that I did last year and when I say me, I mean I'm a product manager I add value Little to no work that I did, little to no work that I did helping the team deliver some tools in SIG multicluster, which is part of upstream Kubernetes, right.

Nick: 54:00

And the purview of like SIG multicluster is to start to help build the primitives from multicluster tool chains across upstream Kubernetes which we then use in the managed Kubernetes world. So the one that we shipped right at Utah, so at KubeCon or in that Kubernetes lifecycle version, was cluster inventory, right. So it turns out like all these tools that a lot of customers are using to manage multiple clusters Argo, cd, flux, fleets they're basically just lists of clusters with metadata that allow you to kind of manage all of them, right? So if you want to use ArgoCD and Fleets or Daverno or whatever tools you want to use, you now have to manage like three or four or five different cluster lists, right? So every time you add or remove a cluster now you have to write glue code to make sure that that update is synced across all those lists. So what we decided to do up term, up, up, up term now you have me thinking about terminals Upstream yes, is to, you know, make one list to rule them all.

Nick: 55:08

But at least this list is sort of neutral, right? You can make changes to this list. My stuff with fleets of teams will respect those changes and like. So if you make a change on a member of the list, it'll actually reflect into whatever tool chain you're using. With fleets of teams, I'm working with the Argo CD upstream team to build plugins so that it works there too. We've got multi-cluster queuing. That's also going to key off this list.

Nick: 55:35

So the point here is that we have this centralized list. That's hopefully going to make the world easier for platform engineers so they don't have to manage like 5, 6, 7, 8, 9, 10 or whatever. So the plug is if you're interested, we would love you to join. So every SIG has a weekly or a biweekly meeting, right. We routinely get engineers from big companies like Google and Microsoft and VMware, and we're all working together. But we would love end users to come and join these meetings to validate what the heck we're building. So please, like participate. Maybe there's's like show notes or something I can give you all a link to to look at and yeah, and it's, and it's, it's open, right, so anybody can join. That's the beauty of kubernetes and these six now anybody.

William: 56:22

You mean so sick, you mean special interest group with the cncf, right? Yeah, awesome. Yeah, we will definitely link the, because I'm sure the you said multi-cluster Kubernetes, sig was the SIG multi cluster. Yeah, okay, yeah, I'll find that group and I'll link everything in the show notes. That's a great idea. Yeah, the more you heard it, the more involvement, the better and more potential.

Eyvonne: 56:48

It has to be just amazing with your input out there and I think so one of the things that I learned I'll say this, and we do need to wrap up but one of the things that I learned after I was an enterprise network engineer and architect is that once I kind of crossed the chasm from the customer to the vendor side, I learned how important smart customer feedback is, and I realized I had a lot more value to my vendors than I understood at the time, because I was hands on keyboard, using their stuff all the day, you know, every day. I knew where the points of friction were, and being able to show up and have meaningful conversations with them about where I had problems, about what would improve the product, was incredibly valuable. And so don't sit there and say I'm at maybe a Fortune 500 enterprise, a medium-sized enterprise, and we do a few cool things, but we're not super huge. Actually, you're in the sweet spot. So show up, make a contribution, talk about what your experience, because that is incredibly valuable.

William: 58:04

Awesome. Well, so for all the kind folks out there, where can they find you, Nick? Aside from your dancing videos on TikTok, when else online are you these days?

Nick: 58:13

Dude, you've seen those. I thought I used the student in. Okay, yeah, so I'm still on X Twitter. I hate to call it X. I am on the Twitters, I still call it Twitter. Yeah, I would love, for I honestly would love for Twitter to be no more. So I'm also on blue sky, um, linkedin, sig, multi-clusters, like if you actually want to talk to me, like in real life as a human, join those things. But also I'd, but if you seriously, if you, if you, if you have questions or customers out there and users of GKE, like I am happy to meet with anybody to get take in that feedback. So just find me on any of those social platforms and I shall respond. Also, I am way opinionated and loud on socials. It comes with a warning label, I guess.

Eyvonne: 59:20

Enter with caution, you know, proceed with caution. You've done great on this talk, Nick, though it's been wonderful.

William: 59:27

Yeah, really enjoyed it. Yeah, we went over an hour. We're over an hour right now. It's crazy, it's good stuff. Yeah, you know over an hour, we're over an hour right now, it's crazy, it's good stuff. Yeah, you know it's a good conversation when you don't realize how how long it went. Yeah.

Nick: 59:38

I feel like you need to thank you very much. Some like I'm just like you, probably cut this part out, but you should like close us out with something about a Lord of the Rings YAML. Just have the bookend.

William: 59:53

I should. I need to come up with something clever. I actually think, though, that's a good idea.

Nick: 59:57

You're always thinking about the good guys. What about the bad guys? What is Soran in this analogy? Is he the underlying API that YAML is trying to organize? Is he a DDoS?

William: 1:00:08

What is he? It could be both, yeah, and he had many agents in many places. That's the thing you throw in some VPs and some part of the bad part of the bureaucracy of the organization too.

Nick: 1:00:22

Imagine he's just like a SQL database that has nodes in 10 different regions over the world. That's what he is. There's never actually a commit because they can never agree.

William: 1:00:37

And then the world was in darkness.

Eyvonne: 1:00:41

It's that script that syncs all these free Microsoft SQL databases that exist. Somebody wrote a homegrown script to sync all those. They wouldn't have to pay for SQL and use the free, and that happens in the background.

William: 1:00:58

So that's the one sync that rules them all. One sync, there we go.

People on this episode

William Collins

Host

Eyvonne Sharp

Co-host