Incident response
This page is for IT admins at customer sites. It defines:
- How to recognize a Claresia incident
- How to declare one
- What Claresia commits to (per Mode A/B/C SLA)
- The post-incident review (PIR) cadence
For Claresia internal IR runbooks (engineering on-call), see the internal runbooks repo (not in this docs site).
How to recognize a Claresia incident
Section titled “How to recognize a Claresia incident”Three signals:
status.claresia.comis yellow or red on a service relevant to you- Command Center health-check for your tenant fails (see Step 10 verification list in Onboarding runbook)
- End users report skill failures, missing skills in their LLM picker, or stale data
If you see any of these, don’t troubleshoot — declare to Claresia and we’ll investigate in parallel.
Declaring an incident
Section titled “Declaring an incident”Channel options
Section titled “Channel options”| Severity | Mode A | Mode B | Mode C |
|---|---|---|---|
| Sev 1 (down) | Slack/Teams Connect channel + email support@claresia.com | Same + phone hotline | Same + phone hotline |
| Sev 2 (degraded) | Slack/Teams Connect | Slack/Teams Connect | Slack/Teams Connect |
| Sev 3 (cosmetic) | Email or Hub Browser feedback button | Same | Same |
Severity definitions
Section titled “Severity definitions”- Sev 1 (Down) — Tenant cannot complete a primary workflow:
- SSO sign-in fails for all users
- Hub writes failing
- All LLM connectors offline
- Command Center inaccessible
- Sev 2 (Degraded) — Significant function impaired:
- One LLM connector down (others still working)
- Telemetry pipeline > 30 min behind
- Hub Browser slow but writes succeed
- Onboarding Portal’s smoke-test step fails (existing tenants unaffected)
- Sev 3 (Cosmetic) — Minor function impacted:
- UI rendering issue (e.g., Adaptive Card columns misaligned)
- Specific skill returns inconsistent format
- Dashboard slow in one tab
- Sev 4 (Informational) — User question or feature request
What to include in your incident message
Section titled “What to include in your incident message”- Tenant slug
- Number of impacted users (estimate is fine)
- First-noticed time + your local timezone
- Symptoms (what users see)
- Any recent changes on your side (IdP migration, AAD change, LLM platform update, network ACL change)
- Any error codes from Command Center / Hub Browser
Claresia response SLA
Section titled “Claresia response SLA”| Severity | Mode A | Mode B/C |
|---|---|---|
| Sev 1 — initial response | 30 min | 15 min |
| Sev 1 — full restoration | best-effort | 4h target |
| Sev 2 — initial response | 4h | 1h |
| Sev 2 — full restoration | best-effort | 24h target |
| Sev 3 — initial response | 1 business day | 1 business day |
| Sev 4 — initial response | 3 business days | 1 business day |
Initial response = a human Claresia engineer acknowledges in your channel. Full restoration = service restored to nominal SLO.
Real-time updates
Section titled “Real-time updates”During an active incident:
- Status updates posted to
status.claresia.comevery 30 minutes (or when meaningful change occurs) - Slack/Teams Connect channel gets the same update, threaded
- Email broadcast on Sev 1 to your
incident-contactlist (set in Command Center → Settings → Contacts)
What we’ll ask you for during a Sev 1
Section titled “What we’ll ask you for during a Sev 1”- Confirm you can reproduce
- Confirm tenant slug + impacted user examples
- Browser dev-tools network log (HAR file) for at least one failed request
- Recent
governance_eventrecords (last 1 hour) — pulled from your Hub via the Hub API and shared in your incident channel
We will not ask you for:
- Your IdP credentials
- LLM platform API keys
- End-user passwords
- Hub record content beyond what’s needed to debug
Post-Incident Review (PIR)
Section titled “Post-Incident Review (PIR)”For every Sev 1 and most Sev 2:
- Within 5 business days of resolution, Claresia delivers a PIR report
- Sections: Summary, Timeline, Root Cause, Impact, Remediation, Lessons
- The PIR is shared in your incident channel + as a downloadable PDF
- Your IT admin + CSM can request a 30-min walk-through
PIRs are also written to your Hub as output records of skill
gatespic.incident-postmortem — searchable in Hub Browser.
Customer-side responsibilities during an incident
Section titled “Customer-side responsibilities during an incident”- Don’t rotate credentials in the middle of an incident (it makes diagnosis harder); wait for resolution
- Don’t unilaterally republish skills (Distribution Plane is in a known state — additional pushes confuse forensics)
- Do keep the channel open with timestamps + screenshots
- Do provide your impacted-user count updates as the situation evolves
Recurring incident patterns + standing fixes
Section titled “Recurring incident patterns + standing fixes”| Pattern | Standing fix |
|---|---|
| LLM platform audit log delay > 5 min | Wait — usually self-resolves; we don’t auto-page on this |
| AAD admin consent prompt during onboarding | Pre-consent in Azure → Enterprise Applications → User Settings |
| SCIM bearer rotated by IdP without notice | Subscribe to Claresia “credential health” digest |
| End user reports stale skill picker | Auto-republish runs every 60s; force via Command Center |
Status Page subscription
Section titled “Status Page subscription”Subscribe at status.claresia.com/subscribe:
- Slack webhook
- RSS / Atom
- Generic webhook
Subscribe at the service level (Identity, Distribution, Hub, Telemetry, Command Center, Onboarding Portal, Documentation site) and the region level (eu-south-1 (Milano), eu-central-1 (Frankfurt)) so you only get pings for what affects you.
Mode C special considerations
Section titled “Mode C special considerations”In Mode C BYOC:
- The data plane runs in your cloud → your team owns runbook for Postgres / SharePoint / Snowflake outages
- The control plane still runs in Claresia Cloud → standard Sev 1-4 process
- Cross-cutting incidents (e.g., mTLS endpoint cert expired) are jointly triaged