Troubleshooting integrations

A quick triage guide for the most common integration failures. Most issues fall into one of four buckets: credentials, network/egress, vendor outage, or webhook configuration.

Test connection fails immediately with 401 / 403

Credentials are wrong or have been rotated on the vendor side. Re-issue from the vendor portal and paste fresh into Integrations → the row's edit menu.

Authentication failures (401 / 403 / 404) are not retried — they're treated as configuration signals, not transient noise. Repeated 401s usually mean credentials are invalid, expired, or scoped incorrectly.

Common causes:

Autotask — the integration code is wrong, or the API user has been deactivated.
Datto RMM / Datto EDR — API key has expired. Datto keys are particularly prone to silent expiry; check the Datto admin portal.
IT Glue — the API key was revoked, or the wrong regional endpoint is set (EU vs. US — see Overview).
Pax8 — OAuth client secret rotated.

Test connection times out

Usually a network-egress restriction on the vendor side (IP allow-list). Ops AI's outbound IPs are listed in the in-app help — ask Netmo support if you need them.

Vendor 5xx errors and timeouts are automatically retried with exponential backoff. A single log line for a terminal 5xx means the retry loop already exhausted — the vendor is genuinely unhealthy, not flaking.

Worked yesterday, broken today

Open Integration Status (/app/integrations/status). The last-check timestamp and most recent error message tell you whether it's:

A transient outage — try again in a minute.
An auth expiry (very common with Datto) — re-paste credentials.
A vendor-side incident — check the vendor's status page.

Stuck in "checking…" for hours

The health probe runs every 5 minutes; a stuck check usually means the worker is wedged. Force a recheck from the integration row's menu, or contact support.

Webhook events failing

The Webhook Events page surfaces a Failure Analysis panel for every failed event. It shows:

Pipeline stage — colour-coded badge indicating exactly where processing stopped (validation, sop_matching, agent_dispatch, agent_execution, tool_call).
Error type and message — a short classifier (e.g. no_sop_matched, tool_execution_error) plus the underlying vendor error.
Suggested fix — a remediation hint (e.g. "Create a SOP with keyword 'disk_full' for source 'autotask'").
Metadata — stage-specific detail (vendor, tool name, agent run id, closest-matched SOP) in a collapsible JSON viewer.

Apply the suggested fix, then Replay the event to push it through the pipeline again.

Replay guards — what you'll see when a replay is rejected

Replays are subject to four guards to prevent double-mutation:

Error code	HTTP	Meaning	What to do
`replay_in_flight`	409	A replay of this source event is already received or processing.	Wait for the in-flight replay to finish; the response includes `conflicting_replay_id`.
`replay_cooldown`	429	You replayed the same source event in the last 5 minutes.	Retry after the cooldown — `retry_after_seconds` reflects the actual remaining cooldown.
`replay_confirmation_required`	422	The original event already succeeded. Replaying will re-run agents, re-create tickets, re-send notifications.	If you're sure, pass `force=true` (single replay) or set `"force": true` in the bulk replay body.
`replay_rate_limit`	429	You've exceeded the per-MSP replay rate limit (10 replays/minute).	Wait — the response includes `retry_after_seconds`.

Reconcile with vendor

Some vendors (Autotask for webhooks, Microsoft Partner Center for GDAP) offer a Reconcile action on the relevant page. Reconcile is a read-only diff between local state and the vendor's view; it never deletes anything without explicit confirmation, and it never touches resources we didn't create.

Run reconcile periodically (weekly is a good cadence) to catch drift early — for example, a webhook that was deleted manually in Autotask, or a GDAP relationship that was terminated outside Ops AI.

Still stuck?

Contact support@msp-ops.ai with:

The integration vendor + instance ID (visible on the Integrations page).
A screenshot of the Integration Status detail or the Failure Analysis panel.
The timestamp range you're interested in.

Support can look up the correlation ID for any failed call and pivot into vendor logs from there.

Test connection fails immediately with 401 / 403​

Test connection times out​

Worked yesterday, broken today​

Stuck in "checking…" for hours​

Webhook events failing​

Replay guards — what you'll see when a replay is rejected​

Reconcile with vendor​

Still stuck?​