Fix Production Bugs 20 Times Faster

talk 27 connections

Callaghan's wroclove.rb 2025 talk subtitled 'The Power of Structured Logging'. Opens with a Friday-afternoon on-call story at BiggerPockets where password reset emails silently fail to send, producing a blind spot in logs, Sidekiq dashboards, and Sentry. Surveys 50 engineers (half hit invisible issues monthly, a third weekly) and surfaces the emotional toll (annoyed, bored, pressure, self-doubt). Introduces the five-step 'Steps to Observable Software' (SOS) cycle: Question → Data → Instrumentation → Graphs → Improve. Walks the password-reset example, formulating hypotheses (failing / timing out / delayed jobs), refining the question to 'what jobs take time in the within-five-minutes queue', and deciding data as event × filter × group × value (expressible as SQL). Argues 'three pillars of observability' is a misnomer — traces, logs and metrics are data types with traces best and metrics worst — and picks logs as a pragmatic gateway drug. Compares plain logs (unsearchable strings, needing regex) vs structured logs (attributes as columns, filter/group/sort-able). Evaluates Ruby logging libraries against criteria (structured payloads, Rails integration, docs, maturity) and recommends Semantic Logger — used in BiggerPockets production for three years. Demonstrates installing rails_semantic_logger with JSON format and active_job autologging, then addresses three weaknesses: (1) no conventions → adopt OpenTelemetry Semantic Conventions (e.g. messaging.destination.name for queues, since OTel models background jobs as messaging alongside Kafka/RabbitMQ), swapping the semantic_logger subscriber and event formatter to emit OTel names; (2) missing attributes → use config.log_tags (lambda and method-shortcut syntaxes) in application.rb to tag every request with HTTP headers and user agent; (3) API requests missing → register Faraday middleware that logs outbound HTTP calls (URL, duration, etc.). Shows shipping logs to Dynatrace via a semantic_logger HTTP appender that batches and POSTs. Revisits the incident with observability in place: a monitoring alert fires on Friday morning when the within-five-minutes queue latency breaches 15-minute SLA; grouping jobs by class points to AnalyticsUpdateUserVisitsJob; grouping enqueues by HTTP resource points to ProfilesController#show; grouping requests by IP reveals a scraper; blocking the IP in infrastructure restores the graph in minutes. A second anecdote shows catching another scraper guessing usernames by seeing a 404 spike, blocking it, and reducing request time 7.1%. Closes with results (98% downtime reduction, 83% fewer 500s, 20× faster bug fixes) and a QR code to articles. Q&A covers cost control, domain-object logging, PII redaction, schemas, logs-vs-metrics for alerts, traces vs logs, and the events-all-the-way-down instrumentation architecture.

type

talk

subtitle

The Power of Structured Logging

talk Fix Production Bugs 20 Times Faster

about

Structured Logging concept

Talk is subtitled 'The Power of Structured Logging' and makes the case for structured over plain logs.

talk Fix Production Bugs 20 Times Faster

about

Steps to Observable Software concept

Introduces and walks through the SOS five-step cycle.

talk Fix Production Bugs 20 Times Faster

about

Semantic Logger tool

Recommends and demonstrates semantic_logger as the structured-logging library of choice.

talk Fix Production Bugs 20 Times Faster

about

OpenTelemetry Semantic Conventions concept

Adopts OTel semantic conventions for attribute names over structured logs.

talk Fix Production Bugs 20 Times Faster

about

OpenTelemetry tool

Discusses OTel as a standards body and the maturity status of its Ruby library.

talk Fix Production Bugs 20 Times Faster

about

Sidekiq tool

Password-reset background jobs run on Sidekiq; the dashboard was insufficient for diagnosing queue delays.

talk Fix Production Bugs 20 Times Faster

about

Faraday tool

Outbound API logging middleware is demonstrated on Faraday.

talk Fix Production Bugs 20 Times Faster

about

Dynatrace company

Demonstrates shipping structured logs to Dynatrace via an HTTP appender.

talk Fix Production Bugs 20 Times Faster

about

Ruby on Rails tool

Entire talk is scoped to a Rails production application.

talk Fix Production Bugs 20 Times Faster

about

Logs vs Traces vs Metrics concept

Argues against the 'three pillars' framing and ranks the three data types.

talk Fix Production Bugs 20 Times Faster

about

Rails 8.1 tool

Mentions upcoming structured-logging improvements in Rails 8.1.

talk Fix Production Bugs 20 Times Faster

about

Sentry tool

Team used Sentry to confirm that jobs weren't failing.

question Cost control for structured logging

asked_at