Show & Tell
we have 15 workflows running in production and monitoring them used to be chaotic. heres what i set up:
1. heartbeat table: every workflow writes a row after each successful execution with the workflow name, timestamp, and execution duration. a scheduled "watchdog" workflow runs every 30 min and checks if any workflow hasnt written a heartbeat in its expected interval. if something is late it alerts on slack
2. error aggregation: all workflow error branches route to a single "error handler" sub-workflow that logs to an error table and sends a slack notification with the workflow name, error message, and a link to the execution log
3. daily summary: scheduled workflow at 6pm that queries the heartbeat table and error table, calculates success rate per workflow, and emails me a one-page summary
4. the error table also has a "resolved" boolean column. during our monday standup we review unresolved errors and mark them as resolved after fixing
this setup took about a day to build and has prevented multiple production issues from going unnoticed. highly recommend some version of this if you have more than 5 workflows
the heartbeat table approach is really clever. we have something similar but less organized. going to steal this exact setup. the watchdog workflow checking for missed heartbeats is the part we were missing
this is basically what we do at the agency for our client workflows. a few additions worth considering:
- add a "last successful run" field to the heartbeat table so you can see at a glance whens the last time each workflow ran successfully
- color code the monday standup report (green = all good, yellow = had errors but recovered, red = currently broken)
- set up different alert levels. a 5-min delay on a daily workflow is fine. a 5-min delay on a real-time webhook processor is not
the daily summary email is such a simple idea but so valuable. knowing your success rate across all workflows at a glance is something every ops team needs
this should honestly be a template that ships with the platform. the monitoring setup is something everyone with 5+ workflows needs