How to add custom agent skills

Agent Skills are short, structured runbooks the SRI Agent can invoke on demand. Writing one turns your best on-call playbook into something Chat runs for the team, consistently, every time. This tutorial walks through building and shipping one from scratch.

RubixKube uses the open Agent Skills format pioneered by Anthropic and adopted by Cursor, Claude Code, and GitHub Copilot. A skill written for RubixKube is portable to those tools, and vice versa.

Prerequisites

A RubixKube workspace you can admin.
A runbook or playbook you already use, in any format (Google Doc, Notion, markdown).
About twenty minutes.

What a skill looks like

A skill is a YAML file plus optional prompt instructions. Minimal shape:

name: post-deploy-verification
scope: tenant
agent: sre-agent
description: Compare error rate, latency, and resource usage for a service against the last stable window.
allowed_tools:
  - fetch_infrastructure_snapshot
  - fetch_deployment_history
  - analyze_service_health
instructions: |
  1. Take a baseline from the last 30 minutes of the previous stable deployment.
  2. Take the current window (10 minutes after the new rollout).
  3. Compare error rate, p95 latency, CPU, and memory.
  4. Flag any metric worse than 20% against baseline.
  5. Summarise in a short Slack-ready paragraph.

Fields to know:

Field	Purpose
`name`	Unique handle, kebab-case. Used when invoking with `/skills`.
`scope`	`system` (RubixKube-maintained) or `tenant` (your workspace). Custom skills are always `tenant`.
`agent`	Which agent can run this. Today only the SRI Agent is supported.
`description`	One sentence. Used in the skill catalogue and for intent matching.
`allowed_tools`	Explicit tool allowlist. Anything not listed is blocked.
`instructions`	Plain-English runbook. Numbered steps work best.

Step 1: Pick a real runbook

Skills work best when they wrap something your team already does in a specific way. Good candidates:

Post-deploy verification

Compare current metrics to the last stable window after a rollout.

On-call handoff

Summarise open incidents, pending actions, and high-risk changes for the incoming rotation.

Cost anomaly review

Weekly pass to spot services drifting over budget.

Pre-change safety check

Before a risky deploy, run a standard readiness check.

Step 2: Write the skill

Open Skills in the console and click New skill. Paste your YAML into the editor. The console validates the schema live.

Start with instructions that read like a senior engineer’s handover note. Short imperative sentences. Specific numbers where it matters (“20% worse than baseline”, not “significantly worse”).

Step 3: Pick the right tools

The allowed_tools list is the safety boundary. The SRI Agent can only call tools listed here. A few of the most common:

Tool	What it does
`fetch_infrastructure_snapshot`	Reads current state of resources across environments
`fetch_kubernetes_graph_snapshot`	Full topology of a connected cluster
`fetch_deployment_history`	Recent rollouts, rollbacks, and config changes
`analyze_service_health`	Aggregated metric view for one or more services
`query_logs`	Read-only log access for a scoped window
`run_linear_issue`	Create a Linear issue with the output

Only add what the skill needs. Narrower allowed_tools means safer, more predictable runs.

Step 4: Test before enabling

Click Test in the console. The skill runs against your live environment and renders the output. Iterate on the instructions until the answer is what you want.

A skill that touches mutating tools (create issue, scale deployment) runs inside Guardian policies. Anything truly destructive requires human approval at runtime, regardless of what the skill tells the agent to do.

Flip Enabled on the skill. Now the SRI Agent can invoke it two ways.

Intent match. A user asks a question the skill description fits, and the agent picks it up automatically.
Explicit call. A user types /skills skill-name in Chat.

For team-critical skills, link the skill from your on-call README so the rotation knows it exists.

Step 6: Watch it learn

Every run feeds the Memory Engine. Common follow-up questions on top of a skill’s output get folded into its context, so the next run is tighter. You can inspect a skill’s run history and adjust the instructions with real traffic as signal.

Examples to copy

On-call handoff (fully worked example)

name: on-call-handoff
scope: tenant
agent: sre-agent
description: Summarise open incidents, pending actions, and risky changes for the next on-call shift.
allowed_tools:
  - fetch_infrastructure_snapshot
  - fetch_deployment_history
  - list_open_incidents
  - list_pending_actions
instructions: |
  1. List every open incident across all environments. Include severity, owner, and age.
  2. List pending actions (awaiting approval) with their RCA link.
  3. List deployments or config changes in the last 24 hours with rollback risk above medium.
  4. Flag any environment with degraded agents.
  5. Produce a six-line Slack message. Use bullet points. No adverbs.

Cost anomaly review (fully worked example)

name: weekly-cost-anomaly
scope: tenant
agent: sre-agent
description: Weekly pass to spot services or resources drifting over their typical cost baseline.
allowed_tools:
  - fetch_infrastructure_snapshot
  - analyze_cost_drift
  - rank_resources_by_spend
instructions: |
  1. Pull cost data for the last 7 days, grouped by service and environment.
  2. Compare against a 28-day rolling baseline.
  3. Return any service more than 15% above baseline.
  4. For each flagged service, suggest the single most likely reason (scale event, traffic shift, config change).
  5. Produce a table for the engineering leads channel.

Common questions

What is the difference between a system skill and a custom skill?

System skills ship with RubixKube and are maintained by the team. Custom skills live inside your workspace (scope: tenant). Both use the same format.

Can I version skills?

Yes. Every save creates a revision. You can roll back from the skill detail page. Versioning in Git via the RubixKube CLI is on the roadmap.

Do skills cost investigations?

A skill run counts as one investigation on your plan. High-frequency skills (like a daily handoff) are usually the best use of the quota.

Can I share skills across workspaces?

Agent Skills concepts

The underlying model: why skills matter, how they are scored, where they run.

Talk to your infra

Writing good skills starts with writing good prompts in Chat.

​Prerequisites

​What a skill looks like

​Step 1: Pick a real runbook

Post-deploy verification

On-call handoff

Cost anomaly review

Pre-change safety check

​Step 2: Write the skill

​Step 3: Pick the right tools

​Step 4: Test before enabling

​Step 5: Enable and share

​Step 6: Watch it learn

​Examples to copy

​Common questions

​Related guides

Agent Skills concepts

Talk to your infra

Prerequisites

What a skill looks like

Step 1: Pick a real runbook

Step 2: Write the skill

Step 3: Pick the right tools

Step 4: Test before enabling

Step 5: Enable and share

Step 6: Watch it learn

Examples to copy

Common questions

Related guides