← Ingestions

Ingestion c3878fb3 extracted

Format
transcript
Kind
talk
External ID
3. John Gallagher - Fix Production Bugs 20x Faster - wroc_love.rb 2025.txt
Content hash
e027296d947b
Source at
2025-03-14 09:00
Manual extractions are temporarily disabled.

Extractions (2)

Status Model Tokens (in/out) Duration Cost Nodes/edges Read set (nodes/edges) Time
completed claude-opus-4-7
227,743 / 16,484
75,169 cached ยท 13,681 write
247.1s - 35 / 66 122 / 0 2026-04-18 07:42
failed claude-opus-4-7 RubyLLM::BadRequestError: You have reached your specified API usage limits. You will regain access on 2... 2026-04-17 16:18

Content

Today our first speaker is John and u


John currently works at bigger pockets


but this journey is soon coming to an


end and he'll be working at Dina trace


and he'll be developing the best Ruby


age client for you so you can use it


freely and u improve your observability


because John is one of the observability


experts in our community and he'll tell


us about how to fix production bugs 20


but also heard 25 times faster. So,


John, please give it warm welcome and


the stage is


[Applause]


yours. Thank


you. So, I'm going to talk about how to


fix your production bugs 20 times


faster. So, I work for a real estate


company called Bigger Pockets based in


the US. The following is based on a true


story. names have been changed to


protect the innocent or in some cases


guilty. So, stop me if this sounds


familiar. It's Friday afternoon and I'm


on


call and I get a Slack


notification. H, that's a bit strange.


It's from support. What's going on here?


Um, okay. And another


one. Ah, uh, there's a bit of a problem.


We've got an emergency on our hands. So,


we'll search the


logs and we'll um Oh, that's a bit


weird. H, that's not very good. Uh, but


we'll check the code. It'll be obvious


what's going wrong. Um, emails are not


getting through. Password reset emails


not getting through. So, let's check.


Um, and that looks kind of okay, which


doesn't really help me.


Meanwhile, things are


escalating. Customers are complaining.


They're taking to Twitter and they can't


get their password reset


emails. They're actually locked out of


their accounts. So, this is a bit of an


emergency.


So, I'm doing something else, but I'll


drop everything I'm doing because we all


like context switching, right? It's


brilliant. So, I'll pair with a


colleague and we'll add some error


logging to the background


job that sends these password reset


emails. And while we're


waiting for it to deploy,


what it's fixed, it's okay. I mean, that


seems a bit weird, but we'll fix it


later. We'll add it to the


backlog. Um, actually, that's not the


backlog. This is the backlog. It's a bit


in fact. No, that's not the backlog.


That's the backlog. Um, so as I was


saying, we'll add it to the


backlog just there. And that will get


done. I don't know when that will get


done. Um, probably really soon. And


after all, it's only a one-off issue,


right? It's not going to happen again.


And


um, yeah. Okay. Uh, it's happening


again. So, we've got the same problem.


and the support team are completely


overwhelmed at this point. But it's no


no worries at all because we added that


that uh background job logging. So,


we'll just search the logs and


um


no, not again. Okay, it's fine. We'll uh


we'll go to the sidekick background job


Q processing admin dashboard and that um


what does that tell me? It's got some


processed


Not really sure what that really tells


me. Not much. And so at Bigger Pockets,


we were in this vicious cycle. We'd add


a little bit of extra logging, kind of


guess what the problem was, and then it


would kind of go away and then it would


come back and customers were getting


annoyed and it was just really a bad


experience for everybody and we were


banging our head against a brick wall.


So when I come across a problem like


this, I think, well, maybe it's just us.


Let's go a bit wider. So I asked 50


engineers this


question. Basically, how often do you


encounter issues that you don't have


visibility on? And the results kind of


surprised me. It turns out turns out


about half engineers I surveyed say once


a month and another third say once a


week. Okay. So not just us then. And


next I want to talk about feelings. As


engineers, we maybe don't want to talk


about feelings too much. We like data


and we like hard things, but feelings


are also obviously very important. So


the first feeling I have is just


annoyed. Like seriously, in 2025, this


is what we've got to go through to just


understand what our app is


doing. And then I'm just start to get a


bit bored. Like I'm going through the


logs. I'm looking at stack trace. I keep


going through the same three tabs in


Google Chrome over and over again


expecting some


revelation and then the pressure kicks


in. Maybe it's been an hour, two hours


since this production outage and I still


can't fix


it. And then I start to think, hang on,


maybe the problems with me. Maybe I'm


the problem here. Maybe I'm just not a


very good engineer. But hang on. I've


been I've been working with Rails for 15


years now, and I've got to go to my boss


and say, "Oh, I know you're paying me


all this money, but sorry, I can't


figure it out." It's not a great look,


is it


really? So, if this sounds familiar,


I've got some good news. It's not you,


and you're not


alone. And while I've been at Bigger


Pockets, I've developed a five-step


process to help with this problem. First


of all, we talk about what question we


want answered. Then we decide on the


data we want to gather. We build the


instrumentation. We use the graphs and


then we improve. And if you turn it on


its side, it looks like some steps. So I


call it the steps to observable software


or


SOS. So let's walk through these steps


with the example that I've just given.


First of all, we want to talk about the


question we want answered. Now the


obvious question is why are these


password reset emails not being sent? So


that's pretty vague. We can't create any


graph that's going to answer that


directly. So instead of that we need to


come up with some


hypothesis. Why might it be going wrong?


It could be failing. The jobs could be


timing out. They might be


delayed. So we know it's not failing


because we checked in Sentry. nothing


there. They're not timing out. So maybe


they're delayed and they may or may not


be. This is just a guess. Let's try and


prove the guess correct or incorrect. So


just to recap, here's the theory. Here's


what an healthy queue might look like.


The green jobs are our password reset


email


jobs. And here's what we think is


happening in the queue. We think these


green jobs are being pushed down and


being delayed by a whole slew of other


jobs that are coming in before them,


right? So let's try and figure out if


that is what's going on or not. So in


order to do that, we need to find a more


specific question. The specific question


is what jobs are taking time in the


within five minutes queue? That is


something we can answer.


So let's decide what data we want to


gather to answer that


question. And when I think about this, I


think in terms of roughly four


dimensions. What event do we want to


gather? What do we want to filter that


event by? What do we want to group it


by? And what do we want to plot the


values of? So the event is when any job


has been performed.


We want to filter by the job queue. So


we want to see all the jobs performed


within a certain queue. We want to group


it by job class and we want to see the


aggregates of the duration. So that


means we can see the duration per job


class in that


queue. And if all of this seems a bit


kind of


overwhelming, I like to convert it to


SQL. It's really just a SQL statement


that you're writing on events.


Okay, so much for that. Now we come to


the actual building of the


instrumentation. So I'm going to cover


these four


things. Um, and the first thing I'm


going to cover


is why logs like what should we use?


Should we use traces, logs, metrics?


There's a lot of confusion out there


about this. If anybody says to you these


are the three pillars of observability,


run a mile because they're not pillars.


They're different data types and in my


take traces are the best and metrics are


essentially the worst. There's use cases


for all three of them. We're going to


start with logs because logs are most


familiar to developers and uh big


improvements are coming in Rails 8.1 for


structured


logging. Okay. So the next question is


do we use plain logs or structured logs?


Well, the title of the talk is power of


structured logging. So, I'm all sure you


can guess what we're going to use here,


but I want to explore why we use that


and what are structured logs


anyway. Here's what a plain text log


looks like. Hopefully, this looks fairly


familiar. It's a string goes to the


logs. No big deal. But you can


see the um the data here is just smushed


into the string. Okay.


Here's what it would look like in your


observability tool of choice. It would


be a single column just saying performed


action mailer mailer job in whatever.


So, how do we find all the action mailer


jobs? Well, we have to use a regular


expression because it's just a plain


string. Okay, so far so good. What about


if I want to see anything that's over a


second in duration?


How are we going to do that with a


regular expression? Oh, that's right. We


can't. So, this is not very useful. And


I still see people running Rails apps


with plain old logs without any


structure to them. It blows my mind. How


you do it, I don't know because I've not


figured out how to do that. So, this


isn't very good. This isn't great.


Here's a structured log. Now, the first


thing you'll notice is it looks really


weird. I don't see many logs in my


codebase looking like this. Um but this


is the way


forwards. Why? Because you can show


those attributes as a column. So the


attributes are just in just in


essentially a


hash. But what you're doing here is


building up attributes of data a little


bit like a database. So you can show


those attributes as columns and then you


can filter by any attribute. You can


group by it. You can sort by it. And the


tool understands this is a specific data


type and it's a specific attribute. So


this is nice. We like this. Next of all,


I want to talk about what logging


library to use. There are many, many


improvements coming in Rails 8.1, but


they're not here yet. And obviously,


many of you in this room may be using


Rails prior to


8. So here are my criteria. They're


pretty


subjective. Some of them are not. Um, so


can it log structured payload? Does it


integrate with Rails? Is there some


great documentation? That bit is quite


subjective. And is it


mature? So before I go through all the


options, I just want to say thank you to


everybody who works on open source


software in any way, shape, or form. Our


community would not exist without you


all, obviously. So massive thank you.


Having said all of that, here's how it


shakes out.


So semantic logger is the best one that


I've come across. I've also excluded


anything from this slide that is maybe


less than a year old. So there are some


logging tools that are coming up that


are quite promising, but they're just


not quite mature enough yet for


production use. We've been using


semantic logger for three years now at


Bigger Pockets in production and it


solid as a rock.


So, um, let's actually use semantic


logger. And so, you install Rails


semantic logger. Install semantic


logger. Uh, it's got Rails bindings, of


course. You set the format as JSON. And


it gives you some automatic logging out


of the box. So, here are all the


libraries that it works with. I'm just


going to focus on active job because


we're doing background job processing.


So here is what the output might look


like from semantic logger. You'll see


it's nicely structured. We have our


event name


here. We've got the job class and Q


here, which are the other attributes we


care about. And finally, we have the


duration. Okay, so we're all good. We


just install that. And like all these


observability vendors say, including


Dino Trace, you click a box, you install


our library, then everything just works.


That's the dream, right? Well, not


quite. There are a few things that this


does not do and many more besides, but


these are the main main three. So there


is three weaknesses. The first thing is


no conventions. We're all sat here


because Rails compelled us with the


convention over configuration idea,


right?


However, that principle hasn't been


brought through to logs


yet. And so, all of these things are


equivalent for attribute names. Choose


whatever you like. If I'm working on a


team with somebody else, um, it's fine.


We'll just work in our little silos and


you create your attribute and I'll


create mine and then they won't work


with each other. No, that doesn't sound


like such a good plan, does it? So,


enter open


telemetry. This is um a set of


standards. It's one of the fastest


growing projects and it's now overtaken


Kubernetes


um one of the fastest CNCF projects I


should


say. So the hotel library for Ruby isn't


that mature yet. I wouldn't suggest


using it directly in production. Some


people are um but what we can use from


open telemetry is the semantic


conventions which don't require you to


install anything. They're a set of


naming conventions for attributes.


And so here you can see this is the


official name for job Q name. It's a bit


weird. It's like messaging destination


name


um for reasons that uh will remain


nameless. Uh open the open telemetry


community has decided background jobs


are a messaging protocol. So it's the


same thing as CFKA reddis um not reddis


sorry cfka rabbit mq. So anyway, that's


a name and so out of the box, this is


what semantic logger will give you. And


when you format it for open


telemetry, this is what you get. So the


names are all standard and you can now


start to switch between different


observability tools and we've got a


convention that everybody can agree on.


So let's change the code. And to do


this, it's a little bit hacky, so just


bear with me. We find the code inside


semantic logger that does this work. We


copy that class into our


codebase. We need to change the payload


event of the event


formatter. And then finally, we swap out


our subscriber for the semantic logger


subscriber. I can go through this in


detail. Come and come and speak to me


afterwards, but the point is it's a bit


messy, but it can be done.


And so with that, semantic not logger


now spits out something like this, which


is great. The next problem is we've got


missing attributes. So whole things, a


whole load of things that semantic


logger does not give you out of the box


that are kind of essential. So these


kind of things it doesn't give you out


of the box. So headers, it doesn't


automatically log any HTTP headers. Now,


I don't really understand how anyone can


run an app in production and not have


these things logged. It It's kind of


critical, but people do it. So, I just


want to show you a very small way of


logging some of these things. And we're


just going to focus on two, the headers


and the user


agent. And we do this thing called


config.log tags. In our application RB,


we can define a hash. And these are tags


that get included in every request and


they get sent to semantic logger and


therefore they get sent to your


observability tool. As you can see here,


we've got two different ways of doing


this. We've got a lambda syntax. It


passes in the request and then you can


call methods on the on the request. Then


secondly, you've got a a more compact


format and this essentially calls user


agent as a method on request. So it's


less flexible, but it's more compact.


This is just for a straight


mapping. And when you do this, here is


the output in the logs. You can see


we've got some HTTP request headers and


we've got a user agent. So that's nice.


The final thing I'm going to talk about


is API requests are missing. So to


simplify this, I'm going to assume


you're all using Faraday, which as we


all know is the best HTTP client, right?


Yeah. Okay, good. Uh glad I've got you


on board with that one. And we're going


to define some


middleware. That middleware essentially


logs when a when a response comes back


from any API request you make. It will


log all these kind of attributes. the


URL and the duration, all sorts of other


useful things. And then we can register


that and use it in any API requests we


make. And this means we get to see all


the API requests are coming into our app


and all the ones that are leaving our


app, which is incredibly


useful. And here's what it looks like in


the


logs. So, we've covered three of the


biggest weaknesses. There are a lot more


that I won't go into


today. And the final step of building is


actually to send it. So, we've logged


all our stuff internally. Now, it's time


to send it to our observability tool.


Semantic logger here has you covered.


This is one of my favorite features of


semantic logger. This idea of appenders.


So, you can set up an appender here. I'm


going to go off and work for Dinatrace


as mentioned before. So we will create a


little HTTP appender and this will batch


all our logs make an API request to


Dinatrace that will ingest the logs and


it's all good to


go. And this is what it might look like


in Dinatrace. So you can see the logs on


the left there and then on the right


here you can see like the structured


version of each one of those individual


logs and then you can s sort and search


and filter by it in the top


bar. Okay, so that was a lot um bit of a


whirlwind tour of how to build this QR


code will give you a link to my link


tree and you can get slides just come up


and speak to me afterwards. I'll show


the the QR code at the end as well, but


come and speak to me afterwards if you


want any more detail on any of that. So,


the final thing we need to do is the


actual exciting bit. We've done all the


boring work of coding and all rest of


it, which we all agree is incredibly


dull. Maybe not. Um, but we get to


actually use the graphs now. So, here's


a reminder of all the


attributes. And we're going to set the


time


range.


Okay. And now we're going to see all the


jobs performed. So you can see up here


we're searching for everything within


that


queue.


Okay. Next, we're going to group by the


duration of job


class.


Okay. And here's what we


get. This is nice. We can actually see


which jobs take most


time. Final step is to improve. So


there's a whole bunch of things that


we've already done that are not ideal,


but generally speaking, I find it's


helpful to criticize the work I've


already done. I'm a bit of a


perfectionist, so I get a whole load of


improvements from that. I show


colleagues what I've done. Um, and I get


to learn what they want. So very often


I've had the experience of, oh, I've


made this cool graph in Data Dog. And


people's eyes just kind of glaze over.


Um, and they say, well, I don't really


care about that. and say, "Oh, okay.


What do you care about?" "Oh, well,


we've got this thing in production and


we can't figure out what it is." And so


then we can apply those same principles


to their problem, and this is how you


get buyin from other people at your


company to improve this stuff. Show them


what's possible in a tiny little slice


and then give them the tools to make


what they care about


observable. And so we might want to add


other attributes at this point. We might


want to add job latency or IPs or


request ids. We've gone around this


cycle a lot at Bigger Pockets. I've been


working on this for about two years now.


And I want to revisit our original real


example with what's our experience now


if this were to happen. So I'm still on


call sadly. It's it's still Friday.


However, I now


get a different Slack message and it's a


lot earlier. It's in on Friday morning


now. And this message is from our


monitoring tool. That's a bit strange.


So, I click the link and here's what I


see. It sends me to a graph. What is


that blue line doing? That's really


strange. Oh, wow. The within five


minutes job queue has now a latency of


15 minutes. So, it's breached its SLA.


H. So, what's a question I want to


answer? Okay. Um, which jobs are taking


the most


time? So, we'll go to the logs. We'll


search for jobs performed within that


queue. We'll group by job class. We'll


plot the duration. And here it


is.


Okay. So, it's clearly the analytics


update user visits job that's taking all


the time. What the heck is that? That's


interesting. I've just learned something


about my software in seconds, not even


minutes. Okay, that's the answer. So,


what's in queuing those jobs? I'm


curious. We search for the event jobs


encued within that queue. Let's group by


HTTP resource which is combination of


the controller and the


action. And here's what we


get. What's that? That's that green bar


looks


interesting. It's the profile show


controller. So it's the show action in


profiles control. That's a bit strange.


I wasn't expecting that. Okay. Um what


IP address is hitting that action which


is then queuing the jobs? Hm. So, let's


show the number of HTTP


requests to that same resource and we'll


group it by IP


address. Oh, wow. That's a lot of IPs in


that light blue


region. Oh, that's the IP address. Okay,


I've got it. So, let's just recap.


There's a scraper at that IP address.


It's hitting profiles show. It's


flooding the queue with these jobs and


that delays our password reset emails.


This is actually what happened. Real


life


example. And the fix is pretty simple.


We now block that IP in our


infrastructure. And now we've blocked


it. Takes a few seconds. We can check,


is it fixed? We go back to the


graph. Oh, it's back to normal. And


we're done. Just another Tuesday at the


office.


So that was maybe five minutes, seven


minutes in total. No real disruption to


production work, no tickets, no backlog,


just a few little graphs, a few little


queries, and we can actually understand


what's really going on. How does this


feel? Well, as you can imagine, it feels


joyful. Feels like I have a superpower.


Any bug that comes onto that backlog, I


can start to have confidence. I can


actually gather data to understand what


the heck is going on. And what I didn't


anticipate on this journey is it becomes


addictive. I start to go into logs. I


answer one question and I'm done with


the the question. I've got all the data


I want. I get my answer. I go back to


the codebase. I fix it. I push it. I


deploy it and it's all good. I can see


it go back down. But in the process, I


maybe see something else in another


graph unrelated to this and I start to


get curious. As an example of this,


recently I went into a dashboard that


we've made and I saw we've got actually


quite a chunk of big chunk of 404s. I


wonder why that is. Like two minutes


later, I had my answer. It turns out


there's yet another scraper going


through all our usernames from a a aaa


to a to zed


essentially guessing at


usernames. And so it's scraping our


entire site occasionally hoping it'll


find somebody with a valid username. So


again, I blocked those. Our request time


dropped by I think it was


7.1%. Just from that one change. And


again, that would have just gone


unnoticed if we hadn't have hadn't have


had observability. So yeah, it gets


addictive and you start to ask questions


that you didn't even know you had


before.


So, what's the results of this? Over the


last 18 months, we've been able to


reduce our downtime by 98%. We now have


regular pizza parties where we don't


have any downtime at all. Um, our 500


errors have dropped by 83%. We've still


got a lot of errors, but they've


improved massively. And most of all, we


can now fix bugs 20 times faster.


So, if any of this sounds interesting to


you, feel free to scan the QR code.


There's a whole bunch of articles I've


written on there, videos I've done. Um,


and there's one or two little paid uh


things, but most of the stuff is free.


And if any of this stuff sounds


interesting, please grab me. I'm


actually going to be leaving today at


400 p.m. sadly uh to catch a flight


back. But yeah, thank you for your time.


[Applause]


Yeah, thank you for the talk. Uh quick


one about the bills for the logs. Uh and


a long one about the your approaches or


rules of thumb when uh you log stuff on


the application side like um operations


that run or maybe changes that happen in


the database or something like this. I


didn't quite understand that last


question so you may have to repeat it.


Yeah. Yeah, you've covered the Rails


part with all the controllers, jobs and


stuff. Uh, but there is also business


logics and sometimes you want to log


some actions on the business logic side


and you didn't cover it I guess because


everybody does this differently but I am


curious how do you do this? Fantastic


question. Both really good questions. I


like when you were handed the


microphone, I just had this sense he's


gonna talk about


cost. Um because cost is such a a big


issue with zero zero interest rates


going away and we've heard this a lot,


haven't we? So in terms of cost, there's


no silver bullet with cost. Sadly, I


wish there was. Um there are a few


things to say. Number one is by using


open telemetry and starting to embrace


the standards way of doing this you can


chop and change between vendors. So if


one vendor is getting expensive you can


move to another. That's the first thing


to say. The second thing to say is


realistically a lot of people are locked


into a vendor. Um and it's actually not


the open telemetry thing that's hurting


them. It's the fact they're locked into


alerts, dashboards, um, and obviously


the UI for every one of these tools is


different. So there's a big there's a


big switching cost even if you are on


open telemetry. So open telemetry is


going to make it easier and more cost-


effective for people, but there's still


a switching cost. That's first thing to


say. Second thing is I looked at our


logs and I did an audit. Generally


speaking, I've done this kind of journey


towards observability in two or three


companies before my current company and


the pattern is very similar. They


they're paying a nominal amount of money


for some observability tool. Nobody's


using it. Um which absolutely boggles my


mind. Um at a one company I worked at, I


said to them, "What's this cabana


thing?" And they were like, "Oh yeah,


nobody uses that." I said, "How do you


know when things are going wrong?" They


were like, "Oh, we just I mean, we added


a bit of logging and then we" They were


literally like taking a card, there was


a problem. They would add three lines of


logging and then just move it back into


the backlog. And they would do that like


six times. And and they're like, "Well,


let's see if it comes up again. Let's


see if it comes up again. No, let's not.


Let's actually fix it, shall we?" But


anyway, sorry, that's a bit of a


tangent, but the point is they start out


with a tool. They're paying a little bit


of money for it. nobody's using it. And


then we start to induce some


observability and it's all going well


and then people go crazy and start


adding logging for this and logging for


that and traces just in case. Just in


case. And then the costs go out of


control and then you get a call from the


finance department. Um yeah, we're


spending like $7,000 a month on this. Do


you think you could kind of rein it in a


bit? And one of the craziest things I


would think we do as a as a as a


community is we don't connect engineers


to the costs. Engineers don't see the


costs of so you never get to see the


implications of your decisions at least


the financial implications anyway. So um


and then they say right can you kind of


calm it down a bit. So then there's


normally a phase of um I kind of look at


logs and see what's being logged and try


and reduce it down. So um I did a little


bit of work at Bigger Pockets to do this


and over 3 days I reduced the logging


bill by


41%. And it turns out we were logging


the text


okay about 231 million times a


month. Not great. So I think a little


bit of common


sense and this is why grouping and


structured logging is amazing because


you can group by the uh the content key


and some observability tools will show


you patterns in your messages and then


you just work down from the top like wow


you're logging this 231 million times


what's going on and then you can


eliminate that. There's a little feature


in semantic logger where you can pass a


proc which filters out certain logs. So


I use that. You can also exclude it from


the the


collector. Um the second the final thing


to say on the cost thing is only add


what you actually need. And this is why


this cycle for me is so important


because you're adding something that is


an issue for you right now. You're not


guessing. I see these companies say,


"We're going to take three months to add


observability to our product. We're


going to add all these logs and then


nobody uses them." Not a good idea.


Start with a problem that you're


interested in now and add just enough to


answer that


problem. So that's the cost thing.


Anything any other questions from people


about cost before I move on to the


domain stuff?


Um so the domain stuff um I've developed


quite a lot of ideas I've not shared in


this talk uh about how to do this. In


short


um I find it really useful to log or


trace domain objects. So let's say


you're um invoicing a customer. You're


creating a new invoice. So I would uh


create a structured version of that


invoice object. I like to formatted H.


So every domain object in our codebase I


try to make it respond to to formatted H


which means it responds with a hash.


Then I can take that hash and log


it. Which means then you get all the


details about the invoice, the ID, the


customer, the address. And now you can


start to search your logs. Show me all


the issues that happened with a specific


customer, a specific invoice ID. Uh and


then in terms of the actual logic, I


mean you and I can have a chat later


about the the details of your question


because don't want to spend too much


time. Uh hi, thanks for your talk. I


have a follow up question for this what


just said. uh when you log domain object


uh how do you deal with privacy with


sending data to vendor? Yeah, great


question. There is a um a rails feature


for this um called filter


attributes. It's not filter


attributes filter parameter filter.


Thank you. Yeah. Um so uh you can new up


um a standard Rails class and pass these


attributes through that class and it'll


redact anything. You have some kind of


config where you can pass an array of uh


regular expressions or strings for


attributes that are private. So things


like email address, credit card numbers,


all this kind of stuff. And so um as


part of me sending as part of our


observability pipeline, everything we


send to our observability tool gets


passed through that through that call.


But it


is it's not obvious how to do that and


nobody really talks about it. So again,


come and speak to me afterwards for the


details on that. But


um yeah, it's difficult because you


don't want to pepper all your code with


all these filter parameters everywhere.


So I try to do it in one place. Um and


that's just before it goes to the


observability


tool. I should also say depending on the


observability tool you use very often


there's red acting functions on their


side but ideally you don't really want


to be sending any PII over the wire.


Good. Uh so I have two questions. uh one


is that when you start to walk


structured walking and then you have a


lot of attributes then in your


experience can you handle that? So in


the back end you don't specify a schema.


So you allow the application to actually


send what the what the developers want


to send as different attribute but then


you allow them to aggregate and group


based on the type and


can you solve that without forcing some


kind of a scheme on the back end or on


the front end and in general in short no


um so there's schema and then schemalus


um it's Great question. It's something


I've thought an awful lot about because


if you are logging these domain objects,


these domain objects by definition are


going to be different depending on the


different application. So you can't have


one set of standards and to rule them


all. Now I am confident and hopeful that


open telemetry will start to move into


this area of domain objects maybe in 10


years or something when they've like


tackled all the really gnarly problem


all the other really gnarly problems.


Schema.org or is a way of marking up um


web pages with standard attributes based


on kind of business objects. So I've


been looking at how I could use


schema.org


um in for observability basically. But


schema.org is a little bit loosey goosey


with things. It's a little bit generic.


Uh in short, I haven't found a good


answer to uh the problem you're


proposing. However, um uh you may have


seen these tools like segment and post


hog and Google Analytics. All these


analytics tools, analytics overlaps with


observability a lot more than people


realize. Analytics is really just


observability and it's the same kind of


thing. So, uh for these events in our


codebase, I've started to introduce um


JSON schemas to validate the attributes


of domain objects. And that's actually


working really well. Uh it means that


anything that we're trying to log or


trying to it's not logging actually it's


for the event side of things but


anything we try to send to our segment


instance is blocked if it doesn't have


the correct attributes and we get an


error in sentry and then we can fix it.


So, no more nils. Um, but of course the


cost is everybody needs to agree on the


schema. You need to version the schemas.


Everybody needs to be in sync with that.


And how do you create that? I've had to


create a like massive notion documents


and train people on how to do that. So,


as we say, nothing comes for free,


right? But it's a really interesting


topic and area and I would love to get


more into that.


Thank you. That's my experience as well.


So uh and the other one is that you


showed that when you change you now have


alarms based on some criteria on the


walks. In your experience what is the


volume of walks that you can use to


create alarms compared to metrics?


Because metrics are basically like


optimization for that. So how long can


you go with walks for alarms?


How long can you go with logs for for


using walks to actually trigger


alarms? Have experienced that you need


to go to metrics because it's too slow


to actually query all the walks at 5


minutes. I don't know at 1 minute. So


what's the volume that you have


experienced?


I


like I I find a lot of the logging that


we do is very high volume. So we have uh


3 million members on our platform. So if


we're logging pretty much anything


that's in common use, we'll get hundreds


of thousands of logs. So I I don't tend


to find that problem. Um I tend to find


that the logs rack up and then I can do


aggregations based on the last five


minutes of like the group and sort by


stuff that that you saw there. And I set


some thresholds. What I will say is


despite showing you alerts in here, I'm


quite reticent to add alerts. Um, they


can be really noisy and then you and


sometimes noisy alerts are worse than no


alerts at all because then you start


ignoring them and then when there really


is a problem, it's the boy who cried


wolf syndrome. So alerts are kind of a


double-edged sword. I much prefer to


explore things and actually build the


tools to allow you to ad hoc explore


things and just use like alerts as kind


of a little bit of seasoning here and


there for really really critical things


like we're not getting any events


through to our segment instance for


example our logs are about to breach


their their quotota for the month things


like that but yeah we can talk more


about the details later thank you


okay very aware aware of time here.


So, thank you for your talk. You


mentioned that traces are the best and


yet your talk is all about logs. So, I'm


I'm curious like what are your thoughts


on traces and uh why for for example for


this problem is it possible to use


traces instead of locks? Um great


question. Thank you. Um I really


wrestled with this a lot when making uh


this presentation. I have a workshop


that I delivered in crackoff earlier


this year where we started out with


logging and then moved on to tracing. So


we saw the comparison with both of them.


I chose logging because it feels a bit


more pragmatic. Tracing feels more


magical to developers and I would I


reasoned that there was a lot more to


explain when I say to people structured


logging. they generally say, "Oh, what's


that? Yeah, that sounds interesting."


Whereas when I say tracing to people,


generally I get dead air, dead eyes. Um,


yeah. So, I do think tracing is the most


powerful.


Um, the reason is because traces capture


context. So in traces you have a


structured tree of events whereas logs


you have a series of events and there's


no nesting. You can't say show me all


the logs that happened in this specific


context. I mean you kind of can but it


it gets it gets super involved. Whereas


with traces you can um you can query at


any level of the tree. So you can say


show me all the events that were nested


inside this other event essentially. So


it gives you that hierarchy. That's


really as I'm as you're de as I'm


demonstrating now it's very difficult to


explain. Well, that was a terrible


explanation. So I do apologize. So


again, come up to me afterwards and we


can chat more about it in detail.


Um yeah, this this is all possible with


traces as well. Uh the other thing with


traces is generally speaking they're a


lot more expensive than


logs and they are a lot slower than logs


as well. Um again a lot of people in the


S sur community would take issue with


what I've said being so blanket. There's


times when logs are more expensive than


the traces, blah blah blah. But in


general, they're slower and more


expensive because the the the code has


to capture all of this context. And then


you run into issues around head


sampling, tail sampling. How do you keep


your tracing costs under control? And


that's an even bigger problem than


logging costs. So generally speaking, I


find logs are the sweet spot if you've


not done observability before. That's


like the gateway drug to traces. start


out with logs, add little bits of logs


here and there to answer your actual


questions that you have today and then


graduate to traces. But yeah, you and I


can talk more after and I can give you


the full rundown. Okay, John, the last


one. Yes, absolutely.


So just like a small follow-up question


to that because um so I realize that


this is about logging and tracing and


stuff is out but since you mentioned you


evolved several times through this


process have you jumped to something


like the conclusion I've like sort of


jumped to at some point was that um it's


more of a question of instrumentation


versus appending versus like the what


you call the appenders and in fact you


could just instrument one thing that is


just this is an event that happens and


it's independent of if it ends up in a


log in a trace or a metric later on,


which is why it's kind of funny because


you see like you have you have a very


structured payload and then you still


have like that content message thing


which is like an interpolated string


which really wasn't necessary at that


point anyways, right? Yeah. So I'm


wondering is that is that sort of the


way you went forward where basically


your codebase is just instrumented with


events independent of what you actually


end up that's exactly it. Exactly it.


And um I really like charity majors take


on this whole observability thing which


is at the root of everything. It's


events. It's events all the way down.


This is why event source systems would


be an absolute dream to instrument and


to to be observable because you just


listen to CFKA. You shove all of that in


your traces or logs and everything's


good. The only caveat to that is when it


comes to traces, as I mentioned before,


you do need the higher levels of


context. So when you're talking about


that, it's not quite as simple as just


events because you need some context to


pass into that event. It's still still


very possible. Um that's the only slight


nuance to that that I would add but


absolutely and in fact um I started


making an architecture such that we


could have a value object which is


represented by it's essentially just a


basic value object to represent an event


and then we have an event.publish


publish that puts it onto an active


support notifications queue and then


there's a listener that listens to that


that can send it to data dog or segment


or post hog or any of these things any


observability tool and any analytics


tool and that would be my dream if we


could pull that off man we'd be in such


a great place because you could just


have this stream of events and say I


want to fire these off to data dog or


dino trace or any other observability


tool and we could have these other


events firing them off to analytics. Um,


again, that's a really big discussion,


but thanks for a great


question. Okay, thank you very much,


John Calagher, ladies and


gentlemen. Thank you. Great talking.