Incident response

This page is for IT admins at customer sites. It defines:

How to recognize a Claresia incident
How to declare one
What Claresia commits to (per Mode A/B/C SLA)
The post-incident review (PIR) cadence

For Claresia internal IR runbooks (engineering on-call), see the internal runbooks repo (not in this docs site).

How to recognize a Claresia incident

Three signals:

status.claresia.com is yellow or red on a service relevant to you
Command Center health-check for your tenant fails (see Step 10 verification list in Onboarding runbook)
End users report skill failures, missing skills in their LLM picker, or stale data

If you see any of these, don’t troubleshoot — declare to Claresia and we’ll investigate in parallel.

Declaring an incident

Channel options

Severity	Mode A	Mode B	Mode C
Sev 1 (down)	Slack/Teams Connect channel + email `management@claresia.com`	Same + phone hotline	Same + phone hotline
Sev 2 (degraded)	Slack/Teams Connect	Slack/Teams Connect	Slack/Teams Connect
Sev 3 (cosmetic)	Email or Hub Browser feedback button	Same	Same

Severity definitions

Sev 1 (Down) — Tenant cannot complete a primary workflow:
- SSO sign-in fails for all users
- Hub writes failing
- All LLM connectors offline
- Command Center inaccessible
Sev 2 (Degraded) — Significant function impaired:
- One LLM connector down (others still working)
- Telemetry pipeline > 30 min behind
- Hub Browser slow but writes succeed
- Onboarding Portal’s smoke-test step fails (existing tenants unaffected)
Sev 3 (Cosmetic) — Minor function impacted:
- UI rendering issue (e.g., Adaptive Card columns misaligned)
- Specific skill returns inconsistent format
- Dashboard slow in one tab
Sev 4 (Informational) — User question or feature request

What to include in your incident message

Tenant slug
Number of impacted users (estimate is fine)
First-noticed time + your local timezone
Symptoms (what users see)
Any recent changes on your side (IdP migration, AAD change, LLM platform update, network ACL change)
Any error codes from Command Center / Hub Browser

Claresia response SLA

Severity	Mode A	Mode B/C
Sev 1 — initial response	30 min	15 min
Sev 1 — full restoration	best-effort	4h target
Sev 2 — initial response	4h	1h
Sev 2 — full restoration	best-effort	24h target
Sev 3 — initial response	1 business day	1 business day
Sev 4 — initial response	3 business days	1 business day

Initial response = a human Claresia engineer acknowledges in your channel. Full restoration = service restored to nominal SLO.

Real-time updates

During an active incident:

Status updates posted to status.claresia.com every 30 minutes (or when meaningful change occurs)
Slack/Teams Connect channel gets the same update, threaded
Email broadcast on Sev 1 to your incident-contact list (set in Command Center → Settings → Contacts)

What we’ll ask you for during a Sev 1

Confirm you can reproduce
Confirm tenant slug + impacted user examples
Browser dev-tools network log (HAR file) for at least one failed request
Recent governance_event records (last 1 hour) — pulled from your Hub via the Hub API and shared in your incident channel

We will not ask you for:

Your IdP credentials
LLM platform API keys
End-user passwords
Hub record content beyond what’s needed to debug

Post-Incident Review (PIR)

For every Sev 1 and most Sev 2:

Within 5 business days of resolution, Claresia delivers a PIR report
Sections: Summary, Timeline, Root Cause, Impact, Remediation, Lessons
The PIR is shared in your incident channel + as a downloadable PDF
Your IT admin + CSM can request a 30-min walk-through

PIRs are also written to your Hub as output records of skill gatespic.incident-postmortem — searchable in Hub Browser.

Customer-side responsibilities during an incident

Don’t rotate credentials in the middle of an incident (it makes diagnosis harder); wait for resolution
Don’t unilaterally republish skills (Distribution Plane is in a known state — additional pushes confuse forensics)
Do keep the channel open with timestamps + screenshots
Do provide your impacted-user count updates as the situation evolves

Recurring incident patterns + standing fixes

Pattern	Standing fix
LLM platform audit log delay > 5 min	Wait — usually self-resolves; we don’t auto-page on this
AAD admin consent prompt during onboarding	Pre-consent in Azure → Enterprise Applications → User Settings
SCIM bearer rotated by IdP without notice	Subscribe to Claresia “credential health” digest
End user reports stale skill picker	Auto-republish runs every 60s; force via Command Center

Status Page subscription

Subscribe at status.claresia.com/subscribe:

Email
Slack webhook
RSS / Atom
Generic webhook

Subscribe at the service level (Identity, Distribution, Hub, Telemetry, Command Center, Onboarding Portal, Documentation site) and the region level (eu-south-1 (Milano), eu-central-1 (Frankfurt)) so you only get pings for what affects you.

Mode C special considerations

In Mode C BYOC:

The data plane runs in your cloud → your team owns runbook for Postgres / SharePoint / Snowflake outages
The control plane still runs in Claresia Cloud → standard Sev 1-4 process
Cross-cutting incidents (e.g., mTLS endpoint cert expired) are jointly triaged