← Ingestions

Ingestion 3df2e268 extracted

Format
transcript
Kind
talk
External ID
5. Maciej Rząsa - Debug like a scientist - wroc_love.rb 2024.txt
Content hash
7098cfd769c9
Source at
2024-03-22 09:00
Manual extractions are temporarily disabled.

Extractions (2)

Status Model Tokens (in/out) Duration Cost Nodes/edges Read set (nodes/edges) Time
completed claude-opus-4-7
899,495 / 11,451
103,307 cached · 10,085 write
189.3s - 20 / 34 388 / 2 2026-04-17 23:20
failed claude-opus-4-7 RubyLLM::BadRequestError: You have reached your specified API usage limits. You will regain access on 2... 2026-04-17 16:18

Content

[Applause]


hello folks I'm machik uh so far we've


been talking about one half of our of


our job about WR about writing features


now we'll be talking about writing


boxs uh and specifically some box some


box are just harder than the other ones


right have you even been there you sit


at work starting at the code starring in


the code and that doesn't work and you


are stuck because you've changed every


code PA twice and it still doesn't


work or you debug as a team and use a


report for the 10 time that it doesn't


work and nobody cares that it works on


your computer and then


uh a guy with a hero complex arrives


saying I know and he disappears for two


days returning with 1,000 lines of


changed code you deployed it to the


production with high hopes and yeah it


it doesn't work right and then the worst


happens because people start venting off


right everybody's frustrated so you


start looking for a scapegoat maybe in


your team or maybe outside right so the


bu becomes a hot potato if you are in in


a bigger organization like I was uh you


basically think okay I can't fix it so


let's reassign it to the other team and


after a week this bu returns to you


right yeah that's that


happens uh and at the end your manager


comes asking what have you been doing


for the last week uh the bug


so how many Bucks have you fixed uh


none okay yeah I think you you have been


there right because the my question


should be not have you ever been there


but how many times you have been there


right because I heard it's a conference


for a senior developers I I always high


highly regarded this conf uh because of


this so yeah how many times have you


been there I have been there numerous


times because I have I've been in in


Ruby development over 10 years and right


now I work for an application that has


terabytes of datab of of data in


database we um sorry we orchestrate a


dozen of machine learning Services it's


awesome but it's also complex so it's


very easy to get stuck in some weird


buxs and it's just the last year what


about what about the pr previous 10


years right uh yeah I have been there


and I must say that when you are there


you start questioning your life choices


right maybe I should maybe anybody was


studing here in in this room hands show


of hands yeah you this the moment when


you think maybe I should go to the other


side of this building and become a


journalist right or maybe when I when I


get stuck on this kind of back I think


yeah I I can still work remotely but for


me working remotely should mean working


for remote parts of Polish Mountain as a


Shepherd not working monly from my


desk yeah you've been there I see it in


your


eyes but I haven't switched my job I I'm


still a developer I used to be a


principal I used to be a how is it


called a team lead now I'm just a senior


developer but the compan is awesome so I


stay there


um and I bring you


hope uh I'm I want to share with you a


method that is very effective for me


that's a way out of the the bugging hell


and I'm here mostly because I've seen it


used I've seen it applied by other


developers other than me uh


independently so I noticed yeah I use


this highly effective method and they


use it too even though I didn't tell


them so it means it's good right um I


want to tell you to start debugging like


a


scientist uh I work for company called


chattermill I told you that's the the


moment of introduction 10 years of


experience I told you Ruby development I


told you yeah let's move on I'm not a


scientist so uh why why should I tell it


to the back like scientist and more uh


and moreover uh sorry David now a bit of


grilling why should we the back like


scientists they are weird right they are


highly


impractical they don't know anything


about bus because what they do they coin


some weird theories they write boring


papers nobody want to read and they


spent public money that could have been


spent on I don't know trains or or


highways or I know kindergartens so why


why should we be like


scientists nobody knows so I'll tell you


uh let's let's get back to the end of


19th century when the physics was in


kind of split State split brain State uh


on one hand physicists thought they they


got it they understand the world it was


a nice feeling on the other hand


they were in a state of a small crisis


regarding the the light the the speed of


light they knew a lot about the nature


of the light they knew it's both the uh


wave and particle they were able to


measure the speed of light which is


wonderful


still they thought that the that the


light to propagate need some kind of


medium just that like sound needs to


prop a medium to propagate right so they


they propose a hypothesis of luminiser


ether that is UNM moving medium uh that


light is using to propagate and it's all


around all around in the space uh


because we can see the stars right um so


they thought it would be nice to measure


how um how fast what's the speed of


Earth uh relative to to the Elph ether


so mikelson mle in 1887 proposed this


very smart experiment when they sent two


rays of light in orthogonal Direction


and then they want to measure the the


speed difference right because uh based


on the classical physics the classical


gallan


Transformations um this speed should be


different and they measured it and the


speed was the same it was a surprising


behavior of the reality this kind of


surprising behavior of the reality in


computer science we called a


bug yeah yeah and surely they were


bugged so what they do they did the same


as we do they patched it they proposed


some ad hoc the some some ad hoc


hypothesis um they that were rather weak


and didn't work and this crisis uh went


on until a young clerk from a patent


office arrived uh writing two papers two


boring papers that nobody


reads um and one important thing is that


Einstein decided to describe the reality


instead of yelling at the reality and he


said okay so from the observations and


from the theory I can say two things for


sure two assumptions that the laws of


physics are the same in every frame of


reference and that the speed of light is


independent of the state of motion he


started with those two assumptions and


then he was able to deduce uh what we


now call the length Contra contraction


and time dilation and he was able to


deduce the LA the lawence transform that


you can see on the bottom it was known


there but it was it wasn't well


grounded and yeah it was just a


hypothesis it was a weird Theory but


some young guy that nobody knew but then


physicist started to uh make experiments


and experiments confirm this Theory


that's something that we call special


relativity at at least at my times it


was something normal to learn at high


school it's you know it's not a rocket


science it's a high school physics


physics we yet but


still so what happened here the results


are less important the method is


important uh physicists saw a weird a


weird state of of reality that was uh


surprising so they guessed how to


explain it that they formulated the


hypothesis then they formulated they


proposed some experiments they made


predictions and they made experiments to


confirm and or reject it and for the


first hypothesis the luminiser is ether


they had to reject it but for the second


hypothesis I told about for the special


relativity they kept it they very they


very fight it uh and that's how physics


goes forward and that's something I


believe we can apply but still you you


can ask it was it was a cool story bro


let's go have a


lunch um but now it was the simpler part


because I have no emot emotional


connection to Mr Einstein or to crisis


in in physics but I do have emotional


connections with the worst stories I'll


be tell telling about and if somebody


starts shaking there are some survivors


uh in this room I believe um so let me


tell you about the first one we had


weird errors in our integration testing


suit in our cucumber suit who loves


cucumber show of


hands yeah it's stable right it never


breaks um we introduced we were


optimizing a graphql


API and we introduced a library to


pre-load


uh SQL to preload data from the


database and this Library started cising


errors uh those errors were were rather


shameful because we were calling a


method that didn't exist so it was


serious and in front of our uh door we


had an angry mop with peach Forks


yelling remove the gem remove the gem


remove the gem fortunately I was working


remotely so it was virtual mob with


virtual pitchforks but the yell was real


we removed the gem and we didn't want to


because It sped it up and we were able


to deliver on time so we really needed


this gem so we started investigating and


we we noticed we started reading the the


source code of the gem and we noticed


that everything should be all right it


works on my computer right because we


have this special um configuration


constant that uh is used in the god


statement but the back happens loaded


right so we could start yelling at the


reality saying it's impossible it's some


it does it didn't happen it doesn't


happen yeah but unfortunately it did


happen and the the the pitforks were


viritual but but dangerous um so we duck


deeper and we realized that indeed we


changed this uh configurational variable


in a before filter but then we change it


back so everything should be good


and that was the moment that I started


staring at the monitor thinking uh I


know


it I have seen something like this and I


thought yeah I guess it might be a race


Condition it's not that I've seen it too


many times as a ruby developer but it


looks like a race Condition it's my wild


gas I need to verify it now the stakes


were high because I was promised a PhD I


I'm still waiting for a


uh but you know uh I lied a bit I wasn't


guessing because I'm a software engineer


I'm a professional it wasn't a


guess I formulated a


hypothesis and for this hypothesis I had


some supporting


evidence the supporting evidence was uh


that uh the thing first thing is that


this before filter changing the uh the


configuration was introduced around the


time when the CI problems started to


happen that was good and the second


thing was that we were using Puma for uh


integration tests and uh we had threads


threading enabled in Puma so yeah it was


possible that it was a uh a Threading


problem so I I tried to experiment and


the first experiment with threats


failed but it wasn't simple enough so I


decided to do something simpler and I


simulated threats I wrote a script the


first one the first part was with the


business logic those three lines this


business logic that could raise an error


and then I interlift it with the uh


before filter and after filter uh that


could run in another thread but I


interlift it manually like line by L


line in the same screen RP I run it and


wow we failed


successfully uh we confirmed that it is


possible that it is uh because of


the uh race condition so we had a good


reason to revert the uh pool request


that introduced this race condition and


the problem was removed fortunately it


was introduced by another team


fortunately and unfortunately um because


uh in normal uh case we would go to the


Steam and say yeah it's it's your fault


try to fix it and they would say no it's


your gem try to fix it and here we had a


very strong evidence saying the gem is


okay guys it's the change that you made


that's not


100% robust uh please please fix it and


they did and everything was good it


wasn't a hot potato it was a well


described B that can be


fixed uh so what happened here is that


we started with some observations about


the guard variable about the place of


the the error then they there was a wild


hypothesis uh that it's race condition


that was verified with a very simple


experiment that allowed us to understand


the reality to describe it correctly and


propose a fix you you know the state The


Bu is fixed not when you push something


to production very often but when you


understand what happens


right and what also happened is that we


started with Gathering data with ob


observation a bit in the lab in a bit in


the


library we verified we made uh we


proposed a hypothesis we verified it and


then we got to the understanding it


looks like a scientific method a bit uh


I I proposed the name hypothesis driven


development because because it's uh


looks good and it you can you can write


a book with something driven something


right


um and yeah it worked but it was a


simple case right I was able to debug


locally but sometimes you can't Deb back


locally sometimes the code fails only on


CI so what then yeah you should have


reproducible CI environment locally


right everybody does this time we


weren't able to reproduce it locally but


only this time right uh and again it was


about cucumber and we had an a


distributed environment several Services


everything was uh orchestrated with


Docker


compose and the setup of the docker


compose was failing from time to time


with some ugly errors about the timeouts


or


about that suggested that something


doesn't work the


infrastructure um so I started digging


because yeah we needed to work we needed


this to work


badly H and I thought yeah maybe it's


it's it looks like a timeout so maybe


it's when we push the data to the graph


CLE Gateway so maybe we push too


much and that was my hypothesis so let's


try to make a lighter request and


instead of doing a huge post let's try


try to do a head simple head and what


happens and puts the the results because


I puts the B the bugger um and there was


a retry and at first it was 404 but and


the second try it was reach timeout I


thought oh read timeout it's different I


put it in my notes it's different it's


weird what is a read timeout what's the


difference between various time ups I


don't know so let's read Around let's


get back to the library I


and find I found a piece of uh


documentation explaining the read


timeout and that it's different from the


open timeout and I thought okay so it's


a read timeout it means that the the


container is started the process is


probably started because the port is


opened but the process doesn't react to


my request that's weird let's put it in


my notes I I don't understand it but


let's dig around so I was Googling


another day and I found out in compose


issues that somebody had an issue that a


container froze and it was because uh of


excessive logging and some weird


configuration state in doer compos I


thought yeah it looks similar but I


don't know yeah but we log a lot we have


a health check that logs a lot like like


a crazy


so let's let's make a hypothesis that


it's because of logging it's weird but


maybe and the simplest uh possible


um experiment let's


um sorry let's disable the loging for


the health check and let's see what


happens and it was fixed so again it's


very easy it gave out this the


understanding it's because of the doer


compost it's it's very easy to blame


with the dependency right it's this next


level of hot potato it's not us it's the


infra um but this time we knew very well


it it's not generic infra it's because


we have two old Docker composed that has


this weird error infra guys please


please please upgrade it for us we had a


very good uh way to saying this uh and


you can see we started with some random


observations I was taking notes for


everything and I understand very little


I connected the dots saying it might be


a bug in Docker compose I made a simple


experiment I didn't break anything but I


had I guess I had to push it to to


master um and it gave us the


understanding and we were more than


halfway there when we reached


here uh but still it was a simple case


it was uh one team debugging it wasn't


very


urgent it was just urgent enough but


there was another case the last one I'm


going to talk you


about uh to tell you about um it was


disgusting it was shameful it was


embarrassing because production was


failing in a regular Cadence every 30


minutes and we had no idea


why we saw those those 502s our business


so saw 52s our clients saw far 52s and


we didn't know why


uh I joined the working group trying to


fix it collected from various uh teams


or rather a task force I believe some


Veterans of this task force can be in


this room um give them some


comfort and we we were trying to um to


understand what happens and we had a lot


of hypothesis maybe we have an


application Level Chron A Clockwork that


everything 30 minutes does something or


maybe we have an infrastructure level uh


cron that does something low level like


I don't know lock harvesting that kills


our discs or or our Network or maybe we


have a client that send a slow requests


every 10 or 30 minutes or or another


service in our in our environment send


this slow request or a bunch of slow


requests um we had a lot of hypothesis


and the work was going in parallel and


that was the moment when I understood


that the method that was very convenient


for me working with hypothesis


experiments uh in a repeated manner was


used by my colleagues that were


definitely smarter than me and I thought


yeah that's a good method I should tell


somebody about it uh so we working like


this and with little progress they they


passed we were more and more embarrassed


there were no Simple Solutions


but we had some observations first of


all somebody noticed that it's just a


single Noe a time it's not that the


whole uh application stops the error is


re reisen by just a single note so we


had an a hypothesis that it's some kind


of stab ling up it's rather obvious


right uh some people started checking


the database uh load


balancer my hypothesis is was that it


must be a memory leak because a couple


months back I was debugging memory leak


and it stayed with me and yeah there was


a memory peek around the time of the


errors and I started taking


notes trying to correlate what we have


one important detail is that we were


seeing the problems in grafana because


we were we had uh a metric of uh the the


length of Passenger q and when the


request Q was growing like crazy like


you can see here it meant that we have


the


problems and that was the the


symptom so I was taking notes okay when


exactly the pro the problem starts okay


it starts at 10:20 then the Q saturates


in about uh 4 minutes and then it drops


and after half an hour it happens again


okay I understand a bit so I correlated


it and I noticed that the first timeout


happens before the state buildup in the


passenger que so yeah that's that's


curious so it's it's just the symptom


and the memory pick is not


before the buildup but after the buildup


so it's not a cause it's it's an


effect and I was uh I sited down to work


on it late evening it's a bad practice


never do this please please please but I


did it because I kept it in my head and


I added some more stats to grafana


because we weren't showing everything


that we had from Passenger and I notice


a curious thing it's too small probably


the yellow chart is number of processes


in passenger that's decreasing and below


this are spawn events when passenger


spawns and new process so I was staring


at this thinking so so the number of


passengers


number of processes drops why passenger


should yeah passenger kills processes


but it should spawn new processes right


maybe they are not killed in the right


in the right way yeah


um then what may why why do we spawn new


processes so rarely um I don't know so


what I did I put a notes on slack and I


go back to bed uh and I went back to bed


um those are rather questions and


answers so I put my observations and I


said yeah so I I don't understand why


new process is not started I don't


understand why the buildup happens on a


certain threshold but I do understand


that I was wrong it's not a memory


league and when I went back when I


returned to work the next day I saw some


answers because a guy from Argentina was


working later and picked up uh what I


left and gave some answers so he noticed


that there are uh some some error logs


in passenger saying that passenger uh


gets start that gets time out on start


and that enters a deployment reses mode


that's a special feature of Passenger


that if something bad happens on uh on


Startup new processes are not started


and it was the explanation why the


problem


happens so we started Gathering what we


have yeah so it's just in one one


instance it's it's a localized problem


uh we have the passenger reses mode we


understood it we understood very well


how passenger process model worked I


didn't know before so we also noticed


that when a new note was started the


problem didn't happen there so it was


really a local problem and somebody


noticed that we have a huge boot snap


cach do you know boot snap it's a


Shopify Jam that helps you start your


rails processes faster because it's cash


cash is a lot uh but when this cash


reached I don't know 10 gigs or 16 gigs


reading the cache was so slow that


passenger was timing out on Startup I at


least that was our hypothesis so what


what was the simplest way of testing it


we removed the boot snap cash


and we hold our breath because the


problem stopped appearing but it it mean


meant nothing maybe it reappears in next


half an hour right but it didn't and


after a day we get to the we got to the


understanding that it was the boot snap


cach that was the problem so how come


right it's a good


gem so the history we need to get back


half a year before when my team upgraded


the boot snap uh Gem and I maybe I was


even uh reviewing this but it was yeah a


number change small upgrade it's okay uh


but with the small upgrade came a change


in config uh and boot snap started


keeping the cash in different


directory and we didn't change our


deployment script so the boot snap was


adding stuff to the cach but we never


cleared it and after half a year it got


to this 10 or 16 gigs and it started


kicking us really so all the classical


methods revert the last deployment or


check the code changes for the last


week did we didn't have chance


right uh but there is one more important


thing that it was a a stupid error right


and there is an important thing it


showed me how to scale debugging effort


it's easy to debug as a single person


but what about the whole team right uh


we were able to follow multiple Paths of


inquiry without blocking each other


because we have multiple hypothesis and


we tested many of them in parallel also


we were working in small teams or a


single on a single hypothesis because


very often one person had an idea it


might be something with the database um


load balancer a developer could say this


but he needed an infra person to really


work with the load balancer and to


verify it uh we also were also to work


around the clock not because we were


tied to uh our keyboards but because we


were spread geographically and the next


time zone was able to pick up what we


prepared right and next it was possible


because we didn't have this hero problem


we were publishing every small piece of


evidence every small observation every


small hypothesis so that we were we were


able to uh work we were standing on on


each other's shoulders right uh what was


required was that it was a kind of a


safe space where we knew that it's about


fixing the problem not about being a


hero and we work as a team and we will


be rewarded in one way or another and


also something that might be uh that


might be important for you if you want


to pay attention for us 30 seconds


that's this moment it's a great way to


work with Junior team members if uh


instead of saying yeah you are too dump


to fix it let me debug it you say okay


so you're debugging this and you seem to


be stuck right yeah right so what's your


hypothesis oh okay what's your OB you


don't have a hypothesis but what's the


weird thing that you observe let's let's


write this down right let's write down


what the internet says about this


Behavior okay so what's your hypothesis


now how to connect the dots um and then


how to verify it uh okay but you need


two days for this right right so let's


find a simpler way maybe let's verify it


in next two hours and then okay it's not


this or yeah it's this we understand it


better so maybe we can fix it you can


use the Socratic method working with


your your Junior team members or with


your peers to help them grow instead of


being the guy that says okay uh I'm the


hero I'll fix it that's that's the more


uh that's the better way of of scaling


your effort um and you do you remember


what where we started with that


frustration being stuck with the Vel uh


with the bugging and thinking about


working


remotely um there is a way out and this


way way out place in science but it's


not about scientific results it's about


looking at the way the scientists work


and trying to apply this method of


working uh to our daily practice and


it's more than just proposing hypothesis


it's also about good practices starting


with the mindset so when you arrive at


the problem don't jump to solution don't


be a guy with a or a girl or a girl with


a pitch fu yelling remove the gem remove


the gem revert the last


commit uh because without an Evidence


it's just your opinion it doesn't matter


if somebody's saying this is an


architect a manager a CEO it's just his


opinion or her opinion and it's um it's


not actionable really it might be a good


way of for uh coining of proposing a


hypothesis to experiment but but not


something real actionable uh then it's


about habits um The crucial habit for me


during the bugging is taking notes of


everything and when I don't understand


asking questions in the same notes it's


the local way of rubber duck debugging


and also it's a great way of being


focused even during the interruptions


you for sure you've read the hundred


hundreds of blog posts you cannot


interrupt the developer at work because


he will be uh removed from the state of


flow and it takes half an hour to get


back right rubbish it's rubbish folks uh


we can't get back to the state of the


flow we can't rebuild the mental state


because we are very better at taking


notes with the notes we are far faster


to get back to work and Inter


interruptions is a it's our daily life


so let's accept it instead of yelling at


it right and next thing there are two


types of developers one can spend them


month in library just to avoid doing


anything in the workshop in a lab and


the other one uh tries to fix the code


for a month instead of reading how it


works for one day right be neither of


them balance this because those two


practices uh really help each other to


to make the bugging faster and finally


communication uh to learn effectively


you need to say I think it's this I


might be wrong but if you say it out


loud if you write it in your notes if


you say it on slack uh you are attached


to this and you understand better why


you why you failed and you know better


what you checked uh then uh it's a


scientific approach that you publish


everything that you know otherwise


you're not a good science


uh you disappear and it's the same if


you know something publish it uh because


in ideal word of course not in the real


word the science uh should work that


somebody has a theory and publishes a


paper and then someone else wants to


check it so proposes an experiment and


and verifies it and somebody else


reproduces this they all publish


papers and in this asynchronous and


decentralized manner the science is put


forward and that's the way to debug in a


uh to take debugging as a collective


effort and also to make it working


scientists shouldn't be awarded for the


volume of their uh Publications in they


shouldn't um and it's the same in the


bugging it doesn't matter how much you


type on slack it matters if you're


observations your


hypothesis uh push the effort forward


maybe you were wrong but at least you


cut some branches and there


is one last thing that I want to uh


remember uh if you want uh the bugging


was a very hard lesson and the way to


make sure that you understand what you


understand well what you learned is to


transfer the knowledge is to tell it to


someone to somewhere else to write a


note on slack to uh tell it on a local


Meetup uh to tell it over a beer to a


friend why because that's the best way


to make sure you remember it h and you


are faster to fix it the next time uh


you can see it thank you I'll now I'll


answer any


[Applause]


questions thanks that was great since I


could see myself myself in many of those


situations there so that was good um I


wonder if as parts of this because


there's more uh formalization on top of


something that we end up doing in a very


natural way so I wonder if we have


anything to share about practices more


process things that's happen in your in


your teams related to to that like um


every every situation like that there is


a postmart that people need to share if


there are documentations that are Tak


can um um um around that for some kinds


of bugs the pr needs to like try to


reproduce the inar things like that that


were introduced as part of the cuture of


the team as in a way for those


situations to um help avoid them in the


future which kind of things more process


are taken that you could that work for


the team that you work on that you could


suggest for other teams as well um so


the question is about more about the


processes on the company level that a


manager could introduce and that could


be helpful of course I do believe that


postmortem have are helpful and in the


company that I work before it it was a


big company so we did have uh the rule


that we have post-mortems still this


rule is not enough because you you can


write a real blame blameless but an


honest postmortem that reads like a


crime novel and everybody reads it or


you can be in in a culture when you when


your biggest concern during postmortem


is that to put the blame on somebody


else and not because you are a bad


person but because the compan is so


bad um so yeah postmortem is are a good


idea but you need a lot of cultural work


around this to make sure that they are


helpful um in a small company I don't


think you need a formal postmortem but


as I said an internal rule yeah we


screwed up let's write it on


slack I think it's I think it's enough


uh would it help us to


avoid problems in the


future ah for the same problems probably


yes but then the set of problems is


infinite so yeah the thing that


surprised me was that uh in one instance


I was watching a postmortem it was the


description of the problem then actions


taken and then I was waiting for a


long-term fix and there was none when I


yelled around that there should be a


long-term fix the manager he was a smart


guy said uh and introducing a long-term


fix would be too heavy for a problem


that will not probably happen too often


so and it was a good call instead of


making the process very very uh heavy to


cover every corner case we say okay we


go fast sometimes we break things


did you consider writing a paper about


this uh I would love to you need to uh


advise me where to publish


it okay uh so if there is no more


questions uh folks PhD of B


ma Jona