The Cloud Gambit

Mobile Development Tales and Thrills with Hanson Ho

William Collins Episode 37

Send us a text

What happens when your mobile app needs to perform flawlessly across thousands of different devices? Meet Hanson Ho - Android Architect at Embrace. In this episode, Hanson shares battle-tested strategies from his experience at Twitter and beyond. Diving deep into real-world challenges, Hanson reveals how mobile observability has evolved from basic crash reporting to sophisticated performance measurement. Learn why traditional monitoring falls short in mobile environments, how to measure what truly matters for user experience, and why the industry is rallying around OpenTelemetry as a standard.


Where to Find Hanson


Show Links


Follow, Like, and Subscribe!

Hanson:

At Twitter, we had to build an entire protocol based on the infrastructure that we have to measure these client performance things. We called it PCT performance client tracing Very similar to what I'm working on right now, which is OpenTelemetry's tracing API, creating spans With that data, the ability on the SDK to measure this. You can start looking at your different workflows and how long it took and you can see a wide range is depending on devices, depending on what you're actually trying to do.

William:

Coming to you from the colossal Cloud Gambit studio. This is your host, william, today. On the pod we have Hanson Ho and you're in Western Canada. Right, I'm in Vancouver, canada. Awesome. Have you ever heard of a sport called?

Hanson:

uh, hockey, is that something you've ever heard of? I might have, you know, seen it once or twice in my life. Uh, you know it's pretty obscure, but uh, yeah, yeah, a little bit awesome.

William:

Are you a vancouver fan?

Hanson:

by any chance I am of course, naturally, uh, a sad, sad that grew canucks fan, the uh franchise with, oh, 50 plus years of not winning the stanley cup, despite getting close twice in my lifetime. Um, but you know, it's, it's. It's what is what is sports if not for the elation of the victory and and the struggle to get there. It's, it's the journey, not the, the process. That's, or not the destination, that's what I tell myself, right? So when, if we win it, it'll just be less good because you know the anticipation will be less. So that's what we all say to cope all kind of absolutely.

William:

We have a young vancouver's, got such a good young core. Now quinn hughes is an awesome captain, an awesome defenseman, and it was that that oiler series last year was. It was crazy.

Hanson:

We played out of our skis, despite not having our, our starting goaltender. You know that you'd Emco is excellent, excellent. We. We threw in a guy who's see, loves who's oh God 10 games, 15 games in the regular season ever. We did pretty okay, despite playing you know, the best hockey player ever in terms of physical ability and skill.

William:

We did a good accounting of ourselves.

William:

Yeah, he was a skating highlight reel in that series. Some of the stuff he was doing was wild. I could actually talk about this all day. I better stop myself now before we get too lost. But yeah, really glad to have you on and you have such an interesting background and such an interesting field that you work in. It's a field that I've actually talked to a few folks on the pod about Open Telemetry and your background. You've worked for some pretty big companies that folks out there may have heard of Salesforce, twitter, among others. Do you want to start us off and just kind of go through your background and sort of how you got into mobile app development specifically, and it doesn't have to be super detailed, but yeah, just a brief overview.

Hanson:

Sure, I think 2015 is when I first got into mobile development. At that time, twitter was hiring furiously, especially to improve its services on the mobile clients, especially for folks who are in emerging markets with devices that are of low quality and the networks that are, shall we say, iffy at best, and the task that I was, or the team that I was working on was tasked with, was improving the experience of folks like that. So that was the first time I started working on Android and that was the first time I started working on you know anything that's really with a lot of network involved and also device quality. I'd used an iPhone previous, but then hopped onto the Android ecosystem and found it fascinating, just because of the diversity. The problems that we have to solve is a lot more challenging, just because of the number of manufacturers of devices, the quality of some of those devices. I mean, they're meant to be affordable and you have to do certain things to make devices affordable, especially back in 2015. So I learned how to optimize performance, network experiment, do experimentation at lower levels. What are the impacts if we changed the concurrent machine requests to four, if we change the window size and do things like that that aren't naturally inclined for mobile software developers to do tweak things under the hood and then see what the differences are.

Hanson:

And from that point on I found a passion for stability and performance on Android and let me kind of become somewhat of a bit of an expert within the company about performance and things like that. And one thing I got especially interested in with is how do you measure performance and have it relate to actual impact of users. So you know, p50 changing. What does that actually mean to users? And to develop a more nuanced understanding of performance, we had to experiment and we had to gather data and you know things happened at Twitter and then I became, you know, a free agent and I hooked up with Embrace, whose you know purpose is to help folks build better mobile apps. So adopting OpenTelemetry as part of the SDK and using that as a vendor-neutral, agnostic transport and format for telemetry is an excellent fit. And where I am now is trying to help kind of improve the SDK as well as just open telemetry and mobile telemetry in general and standardize it.

Hanson:

It's an area where you know folks, when they use an application, tends to be with a mobile phone these days, whether it's an app or a website, the thing that's supposed to use is the one you use, and unfortunately, mobile devices aren't the most trustworthy.

Hanson:

You know, the things I talked about previously, with emerging markets and Android devices, still apply. Not everybody's got a fast phone, and now the problem is just a lot worse because we have a lot more Android versions and we have higher expectations as well. So how do we ensure that performance is up to speed? Well, you got to measure and you got to do it in such a way that you have to be aware of the nuances of the mobile environment. We don't know when things happen unless the device tells us, and sometimes devices just don't tell us because they've gone offline or if the app process gets killed by the OS because there's something in the background that's using too much memory. So it becomes a really fascinating challenge to not only capture the data, send the data, but do it as economically as possible. So, yeah, I take it as a never-ending improvement of how we can best do this.

William:

Yeah, it's funny bringing up the Android ecosystem. I was talking to an Apple mobile developer, not on the pod, but just somebody that I know and he was. I was. I was kind of giving him. So he has iphone and I have a google pixel and he just he always just throws something in there like we could be going out getting something to eat, it doesn't matter, he will make. He like spares no moment to just give me flack for having an Android and I. So I give it back to him a little bit.

William:

But he was sort of venting about sort of what you said. He's like I could never work in this. You know this, the most complex ecosystem in the world, like Android, when Apple, we just have this many OS versions, we have this many phone models and that's it. Everything's locked down. It's very consistent. But you know, somebody could fork. You know android and do this and that, and then there's, like you know, a gazillion different devices everywhere, you know. And he he was talking about how I guess the his from his view like how would you even begin if you were going to build a new app? How do you even build a strategy and and think through and profile how you're going to build a new app, how do you even build a strategy and and think through and profile how you're going to go through the, you know, app development process, how you're going to begin putting things together? Um, what, what do you think about that?

Hanson:

uh, strategically from the beginning, um, thinking through the android way of doing things so the thing with Android is, the platform itself has improved quite a bit since I started working on it the OS itself, the API on top of it, and also the Android platform and SDKs that Google offers in terms of what you need to do in order to just have an app up and running. Up and running, so higher level frameworks for UI, like Compose. You have various methodologies and patterns to create apps. That isn't as free for all as before. You have libraries that are either official, supported by Google, or de facto official, like OKHTTP, where literally not literally, but like almost everyone uses it. So the fragmentation has gone down and the soundization has gone up, and the things you need to do to make your app good and usable for most cases, you know, is a lot easier than it is, you know, nine years ago. Problem is that the long tail is quite long. We have mobile devices that were released nine, 10 years ago that are still in use using Android OSs that are still supported. So Embrace supports up to Android 5, which was released.

Hanson:

I want to say, oh God, I'll get this wrong 2014, 2015, something like that, maybe 2016. No, probably 2015, the earliest. But you have APIs that are different. You have built-in libraries that behave differently. Tls 1.3 isn't even supported on Android 1.7 onwards or previous to 1.7. So you have to install your own. The app has to install additional libraries in order to get the proper networking code in order for it to talk to modern servers that have rejected insecure protocols. So you have these degenerate use cases that become difficult to support. And if you're just starting off, don't worry about all that. Just have a higher version of Android to support and your APIs are a lot smoother and you're dropping 20% of your total addressable market, but you can get started a lot easier. So my suggestion is don't make it work for everybody first. Just make it work for most people, and Android does a pretty good job of letting you work for most people.

Hanson:

Now, big companies, companies with large user bases, especially user bases that are perhaps not super savvy they're just a regular big box retailer that cater to you know folks of all ages and all you know technical, you know astutenesses Well, they might have a phone that their grandkid gave to them from eight years ago and they still want to use. You know the app or your app, and if you don't support, you know the particular version that's older, well, they're not going to be able to use it. So it is important to kind of be aware after you got things working for 80% to improve your experience, for you know the other 20% that's out there. And the faster you make your app, the more of those older OSs, older devices, worst performing devices, devices become usable to you. So you effectively increase your total addressable market. You can have these older phones that be able to use your app.

Hanson:

So I would progressively add, as you can, mobile is unfortunately an unending pit of problems and crashes and interesting bugs that happen. So you can't ever fix anything, everything. You just have to kind of prioritize and triage. It's not about what, it's not about not dropping any balls. It's about dropping the, the smallest balls and the least important balls.

William:

So I love that. That's so good and it's so true. Good, honest answer, that's great. Um, well, I don't want to get ahead of myself. I was about to jump into telemetry and optimization, but I guess kind of going going back, you know, just for the audience. So, um, you're, you know you're, you're an android architect. What, uh, let say, what language do software engineers typically use when building a mobile app for Android? Is it like C++ or something else? What's sort of the standard?

Hanson:

Kotlin in 2024 is the standard. There are mobile developers out there who do Android, who don't even know Java, which would be unthinkable 10 years ago Impossible, actually 10 years ago. But these days Kotlin is, you know, for your iOS, other iOS folks you know. Maybe they're familiar with Swift versus Objective-C. It is similar to, I guess, the Kotlin versus Java difference. You can do native libraries in C++ as well. A lot of graphics, a lot of heavy intensive work is done typically in the native layer like that. But that's not most of the experiences of Android developers. It's just usually Kotlin using Compose to write their apps.

William:

Gotcha Awesome. And so, as you, okay, you begin to build an application and, of course, kind of like what you were talking about earlier, like you can't really understand, quantify or calculate much unless you measure it, and you also like when it comes to measuring things. That means you know testing. Testing is like the hallmark of amazing development in general. Can can you speak to some of the? You know your thought process and practices that you might use as you begin to think through, profile and optimize? Like, what are you thinking about when you're writing tests towards the beginning and how are you thinking about measuring things? Just from, like a first principles point of view?

Hanson:

So first of all, you have to know your app works and you know as much as unit tests. Integration tests are useful. They capture only the scenarios that you can test for for the devices you can test for, for the cases that you think are most important. First of all, having automated unit tests and integration tests is table stakes. If you don't have that, your app is not going to stay high quality forever, especially even if you have the best developers in the world. So you know, the first place is make sure the things you check in don't cause a regression, and on platforms that have different APIs, as Android has, it's easy. So locking that down important, you know. Testing, you know. Clearly, for edge cases, that's important too. But beyond that, having telemetry of what your apps are doing in production is equally important.

Hanson:

So traditionally for mobile apps, people track things like crashes. You know it's always crashes, crashes, crashes, crashes, crashes, crashes. What's my crash rate? What's my crash free rate, all that stuff. And the reason folks do that is because that's the easiest thing to track. Crashes are a discrete event. You have SDKs like Crashlytics that will basically do this for you Capture what's bad, tell you when bad things happen and then you have a sorted list of oh, these things are the bad things that happen, and burn through that list and for a long time. It's about how do we reduce the number of crashes and different instances of crashes. So people have rollout procedures in terms of dog food, beta and then you do a percentage rollout 1% for 24 hours and you ramp it up to 100% if nothing bad happens At each stage, checking to see if there are new crashes and if there are old crashes that have gone higher because some things out have changed. And when it has, you search for the right team to own it and then you kind of fix it and you patch it and you re-release. Typically that's the workflow of mobile stability and that works well for the most part, until you realize that not all bad things happening on mobile apps result in a crash. Sometimes they could just be slow, sometimes things just don't happen the way you think they would in terms of the amount of time it takes.

Hanson:

So we've developed other metrics for it. On Android there's something called A&Rs, which is Android not responding, application not responding, which basically screen freezes. Ios has something similar. You know we also have like jank measurements. You know frame drops Basically, when you scroll a list, you see janky UI of frames dropping. Well, that's considered. Your main thread is being too busy to process the new images coming from the UI thread. Those are okay. Again, people track that because that's what Google gives you. You can go to the Google Play console and see some of this information App startup, also something that folks track. But bad performance could happen in any other stage. And how do you actually detect that?

Hanson:

Well, this is where kind of tracing comes in, creating spans. I mean, I think backend developers are very familiar with the notion of spans and traces. You break a workflow down into effectively a tree of workflows. The top one represents the entire workflow and then you have child spans that represent sub-workflows. If you have a network request, for instance, there's the construction of the requests, the sending of the requests, waiting for the server to come back with the data, deserialization and then persisting the response in the format you want.

Hanson:

You break that workflow down and there is no really good platform on both platforms, ios and Android to actually give you an SDK and also the data exported in a way that you can actually slice and dice to measure your performance, arbitrary performance. So you only click a button and say how long does it take for an image to load. There is no easy way to do it. So at Twitter, we had to build an entire protocol based on you know kind of the infrastructure that we have to measure these client performance things. We called it PCT performance client tracing. Pct performance client tracing very similar to what I'm working on right now, which is OpenTelemetry's tracing API, creating spans.

Hanson:

With that data, the ability on the SDK to measure this you can start looking at your different workflows and how they actually, how long it took, and you can see a wide range of times depending on devices, depending on what you're actually trying to do. You'd be surprised. Well, you may probably not be surprised, but app startup could differ by 10x depending on what device you're using. So, you know, if you're using a newer phone, having app startup take more than a second might seem ridiculous. If you're using, you know, a Moto X from 2015, it's de rigueur for it to take eight and a half seconds. People are used to, you know, staring at their phone and waiting, and it's okay because ultimately they're used to the experience and they start up. So, going back to what you were talking before, what do we measure in production to know that folks are actually you know, things working properly.

Hanson:

Duration is great, fast and slow is great, but whether it succeeded is the most important thing, because typically, when you track telemetry for performance, it's well, when I succeed, I'll record how long it took, took. Well, what if the user gave up and just closed the app? What if the app crashed in the middle? Well, your server doesn't know that. It doesn't even know that your client has initiated something. So, before you track duration, track the fact that something has happened. That's the most important thing. Did the thing that the user is trying to do actually happen, and then you can build performance into it afterwards.

William:

Such a good answer and that just made me think about a lot of other things. I'm trying to think of how to frame this question. So I work for a software company that basically builds software on the cloud providers to connect different clouds, connect on-premises sites, things like that. We're microserviced out, we've got, you know, all these backend APIs that you know we're really, you know, modern as far as, like, software development's concerned. But one of the things we've recently been working on and you know we're, you know, by the time this is going to go out, we would have already released it, but we have a Zero trust network access or ztna, that we're building and one of the things that was really important to us.

William:

For obvious reasons it would probably take too long to go deep, deep into. But um, separating, basically separating, like control plane from data plane type mechanisms, um, where the actual traffic and things are passing and doing things. So I guess what I'm getting at is like based on, you know, technology today there's so many different ways to do things and one of the things that's really important with gathering, telemetry, with anything to do with where networking is like a dependency, really is doing the right thing in the right place, so does that? Is that something that you're thinking through a lot like? Okay, are we going to do this at the end point? How are we going to do it when we're going to shoot stuff back? How we're going? You know all those different things. What is the challenge there?

Hanson:

um, if you can go into that, so, coming from a mobile developer background into observability, open telemetry in general, the one big difference is the operating environment. So in the back end, when you have telemetry that's being recorded, you're fairly sure that you're not going to lose data, or at least not lose it in a significant way, without you knowing it. You can also know that your execution environment is fairly well controlled. You're not going to have clusters that suddenly stop providing enough RAM for you to do things or suddenly say I'm not going to schedule you now thread because I'm busy doing something else, because GC is happening and I need to take two seconds to pause everything. So switching to a mobile environment where, where execution is on the winds of the os, on the winds of the user, on the winds of, of, of even um, the device itself, the device itself, proves challenging. So not only do you have to know that everything around you the APIs, what you're saving things to, what you're sending things out to is hostile and may fail at any time. You also have to know that when it fails, you have to retry and not only just blindly retry, because there's some circumstances where retrying is not good because your device is on an airplane. Why retry your network connection when you know you're not going to have it. So it's building out these assurances at each gateway of capture, of persistence, of sending, of validating that you've sent data and having knowledge that things are completed to the next step. Handoff is warm and not just dropped off and says yeah, see, ya, I'm done. So having knowledge of your own, I guess stability is important and also doing that in as little effect on the app as possible is equally important.

Hanson:

You mentioned where you do stuff. Metrics is a very important thing to have aggregation of duration. So OpenSylometry typically does metrics aggregation both on the I guess the reporting happens on the client side and the. The reporting happens on the client side and the aggregation happens on the server side. So you report metrics, you can do some aggregation, not some aggregation, some summation of metrics and then have it reported and sent some metadata to the server and it'll kind of do all the stuff on the collector level. That's useful because you're expecting a high output or high throughput on the client, so you don't want it to do any sort of aggregation on the client side or any heavy aggregation. So Open Telemetry for the most part says aggregation is done on the server side or on the other end of the telemetry emitter.

Hanson:

Well, on the client, we're not doing a ton of repetitive things and if there's aggregation to be done, it's usually through multiple instances of the app, different launching. So I want to measure my network. You know sorry, not network, but my startup performance happens once every time an app starts up. So if you're aggregating locally, it's not that much data and it doesn't take that long. Reading locally it's not that much data and it doesn't take that long and in fact it'll be much easier if you did this locally, as it is if you did it on the server side. So for that open telemetry, unfortunately the metrics doesn't work super well if you have high cardinality dimensions. So we have to kind of work around it a little bit in terms of what we do.

Hanson:

So changing from a backend execution environment to a frontend execution environment requires some rethinking of these basics. Simply exporting the data as soon as the talent is recorded, assuming that it'll almost always get to the other side. You can't ever assume that. In fact, you could even like start that request and have it fail in the middle of it. Even how do we update and swap data on the client side? We have to be very careful. We don't want to blow up existing payloads that are perfectly good if we have a better one. So managing each of these key steps is super important for mobile developers to be aware of when using open telemetry, which is why you would tend to want to use an SDK that has it built in, so you don't have to do things like that by yourself or do it in an ad hoc way.

Hanson:

What have you thought about when folks background and foreground very quickly and you know sessions get created, or when things are terminated by the app because you're in the background now and Android, once your app is in the background, can kill your app at any time without telling you. So are you saving data in a way that you know you don't do too much as to drain battery but you also don't lose data because you're not caching things? So it's a tricky trade-off, certainly to balance.

William:

Yeah. So I guess getting into the really open telemetry at this point, you said a few things there that I thought were really um, interesting and I guess before I get, before we go deeper, I guess you you sort of uh inclined on, like not reinventing the wheel, um, as it were, not you know, repeating yourself and and so like I guess my question is like where does the so? Open telemetry is a really awesome project. I follow the chats and Slack and things at the CNCF. It's just a really bubbling community, a lot of good work going on. But is is this how you think about OpenTelemetry as a product? Is it like a baseline in which everybody's sort of starting out at the same place so we're not having to go back and redo these, the core things? We want to keep the core and the foundational things steady with a bigger community owned by a foundation. We want those things to always be a level playing field and then we're coming on top of that on that foundation and we're building our products. Is that sort of how that's looked at?

Hanson:

Yeah, the last thing the software world needs is another standard on that foundation and we're building our products. Is that sort of how that's looked at? Yeah, the last thing the software world needs is is another standard, but we do have to have one, and open telemetry is the one. It may not be perfect. There may be things that that that are not suitable for mobile and things like that, but you know, you go. You go to the war with the army. You have a less aggressive metaphor. You use the computer. You have to code the thing. You know what I mean. Opencellometry is very good. It is definitely good enough. It also has a very passionate community looking to make things better. So when I step in with these mobile problems, it's not like oh, this is not how we do things in open telemetry. It is, please tell me more. What can the protocol do to help you achieve what you need to achieve? So building on top of that is something that we want to not only do, but to help better standardize some of the telemetry that's on the mobile and have it be the lingua franca of observability and be able to connect mobile data with backend data. I mean, sres have tons and tons of data maybe too much that are logged in mobile telemetry, like spans and logs, to be able to join the context with the device that triggered a particular API call, that triggered all your distributed tracing. That's pretty useful, especially if we provide context that is not easily derived from simply the backend metrics or the backend metadata. Everything about the client, everything about the payload that gets sent, like, certainly for telemetry persons, you're not going to crack open the payload and say what does it include, you know, and then basically add all this context, um on the client, well, it's all there. All we have to do is not all we have to do, but we can annotate it. And and so you know that what generated a particular uh uh trace that seems anomalous has these characteristics. It's generated from this iOS version, it's generated with this payload that contains data from certain things.

Hanson:

One interesting story back in Twitter days that this kind of like would illustrate is we found an issue with a certain crash in a certain device type and we're like what is happening here?

Hanson:

This device is no different than anything else and we're able to basically use the telemetry to say, oh, this is all happening in Japan, this is all happening in a particular time window. So we know the time window part. So that was like how can clients do this? We found out it was Japan, we found out it was a particular device model and we found out it was because that device model had a pre-existing version of an app that was installed that had these weird characteristics. So we're able to trace that back to the origin of the bug, simply because we have all this metadata that you certainly couldn't have found with just back-end data, even though the back-end did tell us, oh yeah, our SLO for this particular thing has been violated. But we needed the client data in order to actually find the context, and this is what OpenSale Energy gives us additional context for the backend issues.

William:

That's awesome and you kind of bring up a good so differentiating between backend and mobile observability and sort of where they solve different problems at different times. They're definitely different. And something that's interesting is you really don't hear I hear a ton about back end observability all the time, all the time, every, everything, linkedin, just everywhere. I rarely and maybe this is going to change, maybe it's on the swing and it's going to change but I rarely hear about mobile observability. Is that just because it's emerging and it's transforming now, or does it just not get the love that it needs? What do you think about that?

Hanson:

I think mobile developers up until now, have had enough issues to handle without looking for new issues. So I was talking about crashes before and A&Rs those keep mobile developers busy, on top of adding features that are requested, supporting new platforms, new app versions or new OS versions, new SDK versions, things like that. So it hasn't been until now where there are companies that are looking to do better, and most of the time they're the bigger ones. If you look at Twitter, if you look at Facebook time, they're the bigger ones. If you look at Twitter, facebook, netflix, they're going to have proprietary systems to measure workflow, to measure how long things take, but you need teams that are of massive sizes in order to have people who are specialized in mobile performance. I think as platforms settle and things get better, people are going to understand that performance is important and they want to measure it. And it's even more important that when your backend SRE tells me oh, you're violating SLOs, but it's not really sensitive to customer conversion, well, why? Well, it's because you're missing a whole chunk of your workflow and before your server, your backend SREs can even detect if there's a problem.

Hanson:

Your client, your mobile app, has to make that request and that's not a given. So not having data about what's happening on the clients and on the apps means you're you're basically erasing a whole set of problems. Well, what if my network requests never made it off the device because there's congestion thundering herd startup. I'm making 20 network requests at the same time. You never get to the the important one because you haven't prioritized it or you could have deferred a bunch of this stuff. You don't know until you know what's happening on the client.

Hanson:

And and as you don't know until you know what's happening on the client and as platforms and things like OpenTelegram should get more standardized, it becomes easier to do Because you know, as I said on mobile, there's no way to do this without having you know custom SDK code and custom backend code to process, visualize and, you know, give you meaningful information from the data.

Hanson:

Until that process is easier until it's easier for me to just go sign up for a service and have this data appear in a dashboard and have version-to-version diffs and have alerts built when things go badly. You're not going to be able to have anybody talking about it because no one's using it, because the cost of entry, the barriers of entry are too high. But OpenTelemetry and folks like us and folks like the OpenTelemetry community are beginning to break this down. So I'm hoping that if I talk to you in a year or two years, measuring mobile performance is going to be a lot more common, because right now, other than app startup, you're not going to have a lot of people talk about how long things took to do on a mobile app.

William:

Yeah, yeah, yeah. Those are all great points and, oh yeah, something that popped in my head earlier. I just I have to ask, I'm just curious but you're building new software for a startup or you're building something fresh? You're building new software for a startup or you're building something fresh, and of course you have all these great ideas and you can't do it all at once. Usually you have like an MVP and you're starting off slow. You have something that you're taking to market and you're slowly iterating and slowly adding on your modular. Maybe you have multiple micros or whatever that you have that are being worked on in tandem, multiple micros or whatever that you have that are being worked on in tandem.

William:

But you know, observability is just, it's a really important thing because you can't understand what you don't measure and you know you want to understand what's wrong in the performance and all these different things. But as far as, like, the development lifecycle process, when, when do you really start becoming concerned and embracing observability? Like, is that really really early on, like before you're, you know, when you're writing your initial tests and stuff, or is it something that and maybe not? I know there's best practices. We all wish we could do something one way and that is the perfect way. But like we're talking reality here as well. When, when is that? And part of me thinks it's okay. We look at visibility type stuff only when we start running into problems, and that's a horrible way to to do things. But a lot, a lot of times it's just simply reality. But what are your thoughts there?

Hanson:

Yeah, so observability is not just monitoring. Observability is a practice. You have to build your software with that in mind. Now it's tricky to sometimes properly do. The instrumentation, especially when you have the features to write, is something you can just drop in and set it and forget it. So with open telemetry with a lot of back-end services, you could set it and forget it.

Hanson:

You include the package, you turn on the tracing, you know with the configuration, you know by environment variables or some YAML file or something like that, and your commonly used library will emit telemetry. Now you have to set things up on the server side to the other side to receive the telemetry and process it, and you also have to have a dashboard and a vendor maybe to process that and send you alerts. So the good thing is on the back end there's a very well-developed ecosystem for that. You drop it open telemetry, you sign up for grafana and then you get all your metrics and you get your metrics, you get your traces and all that stuff. Fantastic. We need something analogous to that in mobile um. So you know to to plug embrace we effectively is a solution like that drop, drop in our SDK and we'll record all the relevant data for you as metrics or, sorry, as telemetry.

Hanson:

And you can use our SDK to basically create custom traces for your workflows and you can come to our dashboard to see the data. Data gets exported as OpenTelemetry directly on the client to your server, to your collectors, so you can actually you know parse data yourself. You can also use the OpenTel Entry Android SDK. It does something similar. You know, nothing is there on the other end to capture the data, process it and visualize it. You have to buy Kfana or something like that. So it's not a one-stop shop, but it certainly works. And in fact, actually Embrace. You can use our SDK without being an Embrace customer. It's open source. You just go online I'm sure we can make the link available, you know, in the show notes or something like that Drop the Embrace SDK into your app and start seeing the open-source data getting forwarded to your site.

Hanson:

So making it easy is going to make it a lot. So basically making it so that all you have to do is think about needing observability, needing telemetry, and just have to hook up an SDK or two in order to get this data, that will make things a lot easier. Just like people do with crash monitoring drop in Crashlytics, drop in whatever it is and you'll get your data and people will do that. If it's as easy as you say it is and if it doesn't impact your app negatively, like sometimes on SDKs, you could do too much to gather telemetry, you're actually reducing the performance of your app, so make sure that whatever SDK you drop in doesn't do that. Then at worst you have data that you're not looking at. But at best you can have data that you can look at to diagnose problems when they see it or catch problems before you know your customers or your users you know notice.

William:

That's such a good point to make it easier on the developers. I mean, it kind of reminds me of what's. You know security is such a hot button item everywhere right now. But it's like for you know, security is broad button item everywhere right now, but it's like for you know, security's broad. So every layer of every application, every service, every piece of infrastructure, you know security's like everywhere.

William:

But some of the time security to do it the right way is like so hard, it's painstakingly difficult, and that's one of the things I'm seeing in the security space right now. Security engineering is trying to make it easier for folks to adopt these things. Kind of just exactly what you're saying, but just applicable to security infrastructure and those different areas. But it's hard because if something that you're trying to do is just massively painful to do and you practically need to hire a whole team to manage it, you know that just it makes that sort of throws a wrench in your, your bicycle tires. You know and you go flying. So that's a great, great call out and a great point there.

Hanson:

You you want security by default. You want, you know. You don't want to have to think about security after you release your app. It's like oh, how do I make things secure? You want to have that right built right in the beginning. Similar you want observability by default. You don't want it's like oh, how do I make things secure? You want to have that built right in the beginning. Similar you want observability by default. You don't want to have to think about what tolerance you do need afterwards. You want to drop that in and have whatever it is. Do the important things for you. So what you have is you can add on top of that, but the basics you get for free or you get for little to no effort on your part, and that's very important for a developer with deadlines are coming fast and furiously.

William:

Yeah, that's such a good point because right now we live in a day where there's a lot of software as a service and if you don't have security built in or you're really slim on the security side, people notice pretty quick, whereas back in the day, when you're building stuff in data centers, things are a little bit different and you could kind of skimp on some things. Maybe you know, but now everything's sort of front and center and then there is competition that you know, that can eat your lunch, that do have a lot of those things built in. So, yeah, really really good points. This is awesome. I love what you said earlier. Like if you come back and we talk in like a year, you know just kind of evaluating like where things are, and if you know what what has sort of changed. And I would love to take you up on that if you would be willing like maybe revisiting this. You know some, you know this time next year for instance, not for sure.

Hanson:

Not for sure, I mean if Embrace does its job, if it. But open telemetry as a standard is relatively new and even you know the in the time that I've been, you know, uh, involved with it a year or so, it's grown leaps and bounds. So hopefully in a year it's going to be uh, uh. You know people wouldn't think about open telemetry without thinking mobile or think about mobile telemetry without open telemetry, so yeah, I didn't know it was.

William:

Yeah, I guess it is that recent. You know it's. It's pretty awesome seeing that, like companies like embrace and other companies that are using it, something that is so new. It really shows the value of community and foundations and sort of the work that the cncf is doing and just folks coming together and actually wanting to because it's a. It's a downstream and an upstream effect really, because if you don't have some of these foundations really hammered out, like, of course, it impacts the companies that are building stuff, but it also impacts the users that are consuming these mobile apps and they're going to have more problems. They're going to have more problems, you know. So this is kind of a really good example of the, the community, the vendor space and you know just some different areas, sort of getting the rack together and coming together and doing something good, you know, for the.

Hanson:

You know the greater good, I guess I think the entire industry is tired of de facto standards. You know, you know, internet explorer the de facto standards, things are oh, 95 of people using it's the de facto standard. No, no. I want actual standards that are vendor neutral, that you're not locked in. The de facto standards Things are oh, 95% of people are using it, it's a de facto standard. No, no, no. We want actual standards that are vendor neutral, that you're not locked in.

Hanson:

Opensource Mentor is about vendor neutrality and it's funny it's coming from a vendor saying that. But we totally believe in your data being your data and we record things in a vendor neutral way. Eventually, so that you could take this and not have to rebuild your infrastructure. We want you to use us because we provide service on top of that. That is good. But the SDK itself you should be able to use us without having to worry about being locked into Embrace, because without a healthy ecosystem to push each other, you're going to atrophy. You're going to have these vendors or a particular vendor being very large and exerting their market powers into pricing or whatever. No, no. We want this vendor neutral, portable, extensible and open. Telemetry is a cornerstone, a key part of the strategy, so we want to be part of the community. We don't want to own mobile telemetry. It's not about that.

William:

I love that. Yeah, that gives me hope for technology in the future. You know that kind of attitude and that kind of approach. It's great, absolutely love it. So where can the audience? Are you active on any of the social platforms? Where can the audience find you?

Hanson:

My blog is hansenwtf and you can link to all socials there. I still use Twitter for sports things. Basically, I'm mostly on threads where I'm just putting whatever ridiculous things I want to post out there. But, yeah, check out the Embrace GitHub. We have iOS SDK, android SDK, react Native SDK in order to do this stuff. But yeah, I'm on socials. Hanson Ho is a very Google-able name. I'm not the architect from Singapore, I'm from Vancouver, canada. If you Google Hanson Ho and watch through it in Vancouver, it's pretty idiosyncratic.

William:

Right on. Yeah, I'll definitely link these in the show notes so folks can find you easier and follow if they would want to. Yeah, and I just want to say I really appreciate the time. This has been a fascinating conversation. This is such a just an interesting and emerging area and I do look forward to I'll hold you to it. Let's have this conversation in a year. Let's see where we're at.

People on this episode