Skip to main content

Troubleshooting integrations

A quick triage guide for the most common integration failures. Most issues fall into one of four buckets: credentials, network/egress, vendor outage, or webhook configuration.

Test connection fails immediately with 401 / 403

Credentials are wrong or have been rotated on the vendor side. Re-issue from the vendor portal and paste fresh into Integrations → the row's edit menu.

Authentication failures (401 / 403 / 404) are not retried — they're treated as configuration signals, not transient noise. Repeated 401s usually mean credentials are invalid, expired, or scoped incorrectly.

Common causes:

  • Autotask — the integration code is wrong, or the API user has been deactivated.
  • Datto RMM / Datto EDR — API key has expired. Datto keys are particularly prone to silent expiry; check the Datto admin portal.
  • IT Glue — the API key was revoked, or the wrong regional endpoint is set (EU vs. US — see Overview).
  • Pax8 — OAuth client secret rotated.

Test connection times out

Usually a network-egress restriction on the vendor side (IP allow-list). Ops AI's outbound IPs are listed in the in-app help — ask Netmo support if you need them.

Vendor 5xx errors and timeouts are automatically retried with exponential backoff. A single log line for a terminal 5xx means the retry loop already exhausted — the vendor is genuinely unhealthy, not flaking.

Worked yesterday, broken today

Open Integration Status (/app/integrations/status). The last-check timestamp and most recent error message tell you whether it's:

  • A transient outage — try again in a minute.
  • An auth expiry (very common with Datto) — re-paste credentials.
  • A vendor-side incident — check the vendor's status page.

Stuck in "checking…" for hours

The health probe runs every 5 minutes; a stuck check usually means the worker is wedged. Force a recheck from the integration row's menu, or contact support.

Webhook events failing

The Webhook Events page surfaces a Failure Analysis panel for every failed event. It shows:

  • Pipeline stage — colour-coded badge indicating exactly where processing stopped (validation, sop_matching, agent_dispatch, agent_execution, tool_call).
  • Error type and message — a short classifier (e.g. no_sop_matched, tool_execution_error) plus the underlying vendor error.
  • Suggested fix — a remediation hint (e.g. "Create a SOP with keyword 'disk_full' for source 'autotask'").
  • Metadata — stage-specific detail (vendor, tool name, agent run id, closest-matched SOP) in a collapsible JSON viewer.

Apply the suggested fix, then Replay the event to push it through the pipeline again.

Replay guards — what you'll see when a replay is rejected

Replays are subject to four guards to prevent double-mutation:

Error codeHTTPMeaningWhat to do
replay_in_flight409A replay of this source event is already received or processing.Wait for the in-flight replay to finish; the response includes conflicting_replay_id.
replay_cooldown429You replayed the same source event in the last 5 minutes.Retry after the cooldown — retry_after_seconds reflects the actual remaining cooldown.
replay_confirmation_required422The original event already succeeded. Replaying will re-run agents, re-create tickets, re-send notifications.If you're sure, pass force=true (single replay) or set "force": true in the bulk replay body.
replay_rate_limit429You've exceeded the per-MSP replay rate limit (10 replays/minute).Wait — the response includes retry_after_seconds.

Reconcile with vendor

Some vendors (Autotask for webhooks, Microsoft Partner Center for GDAP) offer a Reconcile action on the relevant page. Reconcile is a read-only diff between local state and the vendor's view; it never deletes anything without explicit confirmation, and it never touches resources we didn't create.

Run reconcile periodically (weekly is a good cadence) to catch drift early — for example, a webhook that was deleted manually in Autotask, or a GDAP relationship that was terminated outside Ops AI.

Still stuck?

Contact support@msp-ops.ai with:

  • The integration vendor + instance ID (visible on the Integrations page).
  • A screenshot of the Integration Status detail or the Failure Analysis panel.
  • The timestamp range you're interested in.

Support can look up the correlation ID for any failed call and pivot into vendor logs from there.