← Graph

Fix Production Bugs 20 Times Faster

talk 27 connections

Callaghan's wroclove.rb 2025 talk subtitled 'The Power of Structured Logging'. Opens with a Friday-afternoon on-call story at BiggerPockets where password reset emails silently fail to send, producing a blind spot in logs, Sidekiq dashboards, and Sentry. Surveys 50 engineers (half hit invisible issues monthly, a third weekly) and surfaces the emotional toll (annoyed, bored, pressure, self-doubt). Introduces the five-step 'Steps to Observable Software' (SOS) cycle: Question → Data → Instrumentation → Graphs → Improve. Walks the password-reset example, formulating hypotheses (failing / timing out / delayed jobs), refining the question to 'what jobs take time in the within-five-minutes queue', and deciding data as event × filter × group × value (expressible as SQL). Argues 'three pillars of observability' is a misnomer — traces, logs and metrics are data types with traces best and metrics worst — and picks logs as a pragmatic gateway drug. Compares plain logs (unsearchable strings, needing regex) vs structured logs (attributes as columns, filter/group/sort-able). Evaluates Ruby logging libraries against criteria (structured payloads, Rails integration, docs, maturity) and recommends Semantic Logger — used in BiggerPockets production for three years. Demonstrates installing rails_semantic_logger with JSON format and active_job autologging, then addresses three weaknesses: (1) no conventions → adopt OpenTelemetry Semantic Conventions (e.g. messaging.destination.name for queues, since OTel models background jobs as messaging alongside Kafka/RabbitMQ), swapping the semantic_logger subscriber and event formatter to emit OTel names; (2) missing attributes → use config.log_tags (lambda and method-shortcut syntaxes) in application.rb to tag every request with HTTP headers and user agent; (3) API requests missing → register Faraday middleware that logs outbound HTTP calls (URL, duration, etc.). Shows shipping logs to Dynatrace via a semantic_logger HTTP appender that batches and POSTs. Revisits the incident with observability in place: a monitoring alert fires on Friday morning when the within-five-minutes queue latency breaches 15-minute SLA; grouping jobs by class points to AnalyticsUpdateUserVisitsJob; grouping enqueues by HTTP resource points to ProfilesController#show; grouping requests by IP reveals a scraper; blocking the IP in infrastructure restores the graph in minutes. A second anecdote shows catching another scraper guessing usernames by seeing a 404 spike, blocking it, and reducing request time 7.1%. Closes with results (98% downtime reduction, 83% fewer 500s, 20× faster bug fixes) and a QR code to articles. Q&A covers cost control, domain-object logging, PII redaction, schemas, logs-vs-metrics for alerts, traces vs logs, and the events-all-the-way-down instrumentation architecture.

type
talk
subtitle
The Power of Structured Logging
talk Fix Production Bugs 20 Times Faster
about
Talk is subtitled 'The Power of Structured Logging' and makes the case for structured over plain logs.
talk Fix Production Bugs 20 Times Faster
about
Introduces and walks through the SOS five-step cycle.
talk Fix Production Bugs 20 Times Faster
about
Recommends and demonstrates semantic_logger as the structured-logging library of choice.
talk Fix Production Bugs 20 Times Faster
about
Adopts OTel semantic conventions for attribute names over structured logs.
talk Fix Production Bugs 20 Times Faster
about
Discusses OTel as a standards body and the maturity status of its Ruby library.
talk Fix Production Bugs 20 Times Faster
about
Sidekiq tool
Password-reset background jobs run on Sidekiq; the dashboard was insufficient for diagnosing queue delays.
talk Fix Production Bugs 20 Times Faster
about
Faraday tool
Outbound API logging middleware is demonstrated on Faraday.
talk Fix Production Bugs 20 Times Faster
about
Dynatrace company
Demonstrates shipping structured logs to Dynatrace via an HTTP appender.
talk Fix Production Bugs 20 Times Faster
about
Entire talk is scoped to a Rails production application.
talk Fix Production Bugs 20 Times Faster
about
Argues against the 'three pillars' framing and ranks the three data types.
talk Fix Production Bugs 20 Times Faster
about
Rails 8.1 tool
Mentions upcoming structured-logging improvements in Rails 8.1.
talk Fix Production Bugs 20 Times Faster
about
Sentry tool
Team used Sentry to confirm that jobs weren't failing.
asked_at
Fix Production Bugs 20 Times Faster talk
First Q&A question covering both bills and business-logic logging.
asked_at
Fix Production Bugs 20 Times Faster talk
Follow-up on logging domain objects responsibly.
asked_at
Fix Production Bugs 20 Times Faster talk
Audience question on schema vs schemaless tradeoffs.
asked_at
Fix Production Bugs 20 Times Faster talk
Audience asked at what log volume metrics become necessary for alerts.
asked_at
Fix Production Bugs 20 Times Faster talk
Audience challenged Callaghan's framing that traces are superior yet the talk focused on logs.
asked_at
Fix Production Bugs 20 Times Faster talk
Final Q&A question framing instrumentation as events with pluggable sinks.
authored
Fix Production Bugs 20 Times Faster talk
Callaghan is the sole speaker for this talk.
from_talk
Fix Production Bugs 20 Times Faster talk
Callaghan's cost-control story during Q&A about a 41% logging-bill reduction at BiggerPockets.
from_talk
Fix Production Bugs 20 Times Faster talk
Q&A insight about cost and vendor lock-in.
from_talk
Fix Production Bugs 20 Times Faster talk
Q&A observation about engineer cost visibility.
from_talk
Fix Production Bugs 20 Times Faster talk
Core principle of the SOS cycle articulated during Q&A.
from_talk
Fix Production Bugs 20 Times Faster talk
Q&A answer to the alerts-on-logs-vs-metrics question.
from_talk
Fix Production Bugs 20 Times Faster talk
Q&A answer to the PII privacy question.
from_talk
Fix Production Bugs 20 Times Faster talk
Callaghan's closing reflection on how observability work feeds curiosity (e.g. the 404/username-scraper discovery).
talk Fix Production Bugs 20 Times Faster
presented_at
Talk delivered at wroclove.rb 2025 as the first speaker.

Provenance