Skip to content

Incident response

This page is for IT admins at customer sites. It defines:

  • How to recognize a Claresia incident
  • How to declare one
  • What Claresia commits to (per Mode A/B/C SLA)
  • The post-incident review (PIR) cadence

For Claresia internal IR runbooks (engineering on-call), see the internal runbooks repo (not in this docs site).

Three signals:

  1. status.claresia.com is yellow or red on a service relevant to you
  2. Command Center health-check for your tenant fails (see Step 10 verification list in Onboarding runbook)
  3. End users report skill failures, missing skills in their LLM picker, or stale data

If you see any of these, don’t troubleshoot — declare to Claresia and we’ll investigate in parallel.

SeverityMode AMode BMode C
Sev 1 (down)Slack/Teams Connect channel + email support@claresia.comSame + phone hotlineSame + phone hotline
Sev 2 (degraded)Slack/Teams ConnectSlack/Teams ConnectSlack/Teams Connect
Sev 3 (cosmetic)Email or Hub Browser feedback buttonSameSame
  • Sev 1 (Down) — Tenant cannot complete a primary workflow:
    • SSO sign-in fails for all users
    • Hub writes failing
    • All LLM connectors offline
    • Command Center inaccessible
  • Sev 2 (Degraded) — Significant function impaired:
    • One LLM connector down (others still working)
    • Telemetry pipeline > 30 min behind
    • Hub Browser slow but writes succeed
    • Onboarding Portal’s smoke-test step fails (existing tenants unaffected)
  • Sev 3 (Cosmetic) — Minor function impacted:
    • UI rendering issue (e.g., Adaptive Card columns misaligned)
    • Specific skill returns inconsistent format
    • Dashboard slow in one tab
  • Sev 4 (Informational) — User question or feature request
  • Tenant slug
  • Number of impacted users (estimate is fine)
  • First-noticed time + your local timezone
  • Symptoms (what users see)
  • Any recent changes on your side (IdP migration, AAD change, LLM platform update, network ACL change)
  • Any error codes from Command Center / Hub Browser
SeverityMode AMode B/C
Sev 1 — initial response30 min15 min
Sev 1 — full restorationbest-effort4h target
Sev 2 — initial response4h1h
Sev 2 — full restorationbest-effort24h target
Sev 3 — initial response1 business day1 business day
Sev 4 — initial response3 business days1 business day

Initial response = a human Claresia engineer acknowledges in your channel. Full restoration = service restored to nominal SLO.

During an active incident:

  • Status updates posted to status.claresia.com every 30 minutes (or when meaningful change occurs)
  • Slack/Teams Connect channel gets the same update, threaded
  • Email broadcast on Sev 1 to your incident-contact list (set in Command Center → Settings → Contacts)
  • Confirm you can reproduce
  • Confirm tenant slug + impacted user examples
  • Browser dev-tools network log (HAR file) for at least one failed request
  • Recent governance_event records (last 1 hour) — pulled from your Hub via the Hub API and shared in your incident channel

We will not ask you for:

  • Your IdP credentials
  • LLM platform API keys
  • End-user passwords
  • Hub record content beyond what’s needed to debug

For every Sev 1 and most Sev 2:

  • Within 5 business days of resolution, Claresia delivers a PIR report
  • Sections: Summary, Timeline, Root Cause, Impact, Remediation, Lessons
  • The PIR is shared in your incident channel + as a downloadable PDF
  • Your IT admin + CSM can request a 30-min walk-through

PIRs are also written to your Hub as output records of skill gatespic.incident-postmortem — searchable in Hub Browser.

Customer-side responsibilities during an incident

Section titled “Customer-side responsibilities during an incident”
  • Don’t rotate credentials in the middle of an incident (it makes diagnosis harder); wait for resolution
  • Don’t unilaterally republish skills (Distribution Plane is in a known state — additional pushes confuse forensics)
  • Do keep the channel open with timestamps + screenshots
  • Do provide your impacted-user count updates as the situation evolves

Recurring incident patterns + standing fixes

Section titled “Recurring incident patterns + standing fixes”
PatternStanding fix
LLM platform audit log delay > 5 minWait — usually self-resolves; we don’t auto-page on this
AAD admin consent prompt during onboardingPre-consent in Azure → Enterprise Applications → User Settings
SCIM bearer rotated by IdP without noticeSubscribe to Claresia “credential health” digest
End user reports stale skill pickerAuto-republish runs every 60s; force via Command Center

Subscribe at status.claresia.com/subscribe:

  • Email
  • Slack webhook
  • RSS / Atom
  • Generic webhook

Subscribe at the service level (Identity, Distribution, Hub, Telemetry, Command Center, Onboarding Portal, Documentation site) and the region level (eu-south-1 (Milano), eu-central-1 (Frankfurt)) so you only get pings for what affects you.

In Mode C BYOC:

  • The data plane runs in your cloud → your team owns runbook for Postgres / SharePoint / Snowflake outages
  • The control plane still runs in Claresia Cloud → standard Sev 1-4 process
  • Cross-cutting incidents (e.g., mTLS endpoint cert expired) are jointly triaged