Overview
When a workflow fails, it won't always be for a reason you anticipated. Network timeouts, upstream API changes, and unexpected data can all cause a workflow to stop mid-run without triggering any of the error paths you've explicitly built. In this article, we'll be covering how to monitor for workflow failures — both the ones you planned for and the ones you didn't — and how to make sure nothing slips through unnoticed.
Understanding the two types of workflow failure
Not all failures are the same, and the right approach depends on which type you're dealing with.
- Anticipated failures are errors you've identified in advance — for example, a specific API returning a known error code when a record doesn't exist, or a condition that catches missing required data. These are handled inline using retry strategies and error links directly on the nodes that might fail.
- Unanticipated failures are anything the workflow wasn't designed to handle — network errors, timeouts, malformed payloads, or upstream changes. These cause the entire workflow execution to reach a Failed status, and no in-graph error path will catch them unless you've explicitly set up a catch-all.
This article focuses on unanticipated failures. For node-level error handling, see Handling Node Failures: Retry Strategies, Timeouts, and Error Links.
Monitoring executions for failures
The fastest way to see if a workflow is failing is to open its execution history and filter by status.
- Navigate to Workflows and open the workflow you want to monitor.
- Select the Execution History tab.
- Use the Status filter and select Failed.
This view shows all failed executions in reverse chronological order. Clicking into any execution lets you inspect which node failed, what error code was returned, and what data was in the workflow state at that point.
For workflows running in production, consider bookmarking the execution history view pre-filtered to Failed as a lightweight operational dashboard for your team.
Setting up automatic failure alerts
If you want to be notified when a workflow fails without checking manually, gaiia supports a failure alerting workflow that runs automatically whenever any native workflow execution fails. It captures which nodes failed, the error codes returned, and who triggered the execution — and sends the alert via Slack, Microsoft Teams, or email.
This alerting workflow is added to your environment by the gaiia team. To request it, contact support@gaiia.com. Once added, you can customize routing using Condition nodes — for example, sending API errors to an engineering Slack channel and billing errors to a finance inbox.
For full setup instructions, see Set up failure alerts for workflows.
Reacting to any failure with a trigger-on-failure workflow
For more control over what happens when a workflow fails, you can build a dedicated "failure watcher" workflow triggered by the workflowExecutionFailed event. This approach lets you define exactly what happens in response — creating a ticket, posting to a channel, logging an activity comment — using the same workflow builder tools you already know.
Unlike error links on individual nodes, a trigger-on-failure workflow catches any failure in the target workflow, including errors that no in-graph path handles.
- Create a new workflow.
- Set the trigger type to Event.
- Select workflowExecutionFailed as the event.
-
Use the skip execution mapper on the trigger to filter to the specific
workflow you want
to watch. For example:
Without this filter, the watcher will fire for failures across all workflows in your environment.return state.input?.workflowId !== 'your-workflow-id'; - Add nodes to respond to the failure — for example, a Create Activity Log Comment node to log the failure on the affected record, or an HTTP node to post to a Slack channel.
The workflowExecutionFailed event payload includes the execution ID, the workflow ID, and the object ID the execution ran against. You can use these to query for additional context in subsequent nodes.
Adding a catch-all error link to critical nodes
For individual nodes where any failure should trigger a specific response, you can draw an error link using the * wildcard error code. This catches any error the node returns — not just the ones you've named — and routes execution to a recovery path in the graph.
This is useful when you know a node is critical and want a fallback action (like creating a ticket or sending an alert) to run automatically if it fails for any reason.
For step-by-step instructions, see Handling Node Failures: Retry Strategies, Timeouts, and Error Links.
Choosing the right approach
These patterns work well together. Here's a quick reference for when to use each one:
- Execution history filter: best for periodic spot-checks and investigating specific incidents. No setup required.
- Failure alert workflow: best for passive, team-wide visibility. Notify the right people without anyone having to check manually.
- Trigger-on-failure workflow: best when you need to take action in response to a failure — logging, ticketing, routing. Catches everything regardless of error type.
-
Catch-all error link (
*): best for critical individual nodes where any failure should branch to a recovery path inline within the workflow.
Related to