Agent Sandbox
v1alpha1 to v1beta1 Migration Guide
This document covers the operational side of migrating Sandbox, SandboxClaim, SandboxTemplate, and SandboxWarmPool resources from the v1alpha1 API to the v1beta1 API.
If you install the chart fresh with the v1beta1-storage version, there is nothing to migrate — read this only when upgrading an existing installation that holds v1alpha1-serialized resources in etcd.
What changes between versions
Most CRDs are schema-compatible across the two versions; the migration matters mainly for two reasons:
SandboxClaimis not field-compatible. v1alpha1 hasspec.sandboxTemplateRefplus an optionalspec.warmpoolstring policy ("none"/"default"/ a specific pool name). v1beta1 requiresspec.warmPoolRef.name. The conversion webhook (inextensions/api/v1alpha1/sandboxclaim_conversion.go) handles the rewrite via three branches:- Specific pool name (
warmpool: my-pool) → webhook uses that name verbatim. If the pool doesn’t exist, the converted claim points at a missing pool — operator must create it. ""/"default", warm-started (claim has a boundSandboxwhose name differs from the claim’s name) → webhook derives the pool name from the existingSandboxviastripRandomSuffix(sandboxName). The source pool already exists; nothing to do at migration time. ("none"never falls into this branch —"none"always cold-starts.)""/"none"/"default", cold-start (no boundSandbox, orSandbox.name == claim.name) → webhook redirects toshadow-pool-<template-name>. The bootstrap phase ensures one such shadow pool exists per(namespace, template)combination.
- Specific pool name (
Sandbox.spec.replicasbecomesSandbox.spec.operatingMode.replicas: 0→Suspended,replicas: 1(or unset) →Running. The webhook handles this automatically.
The other two CRDs (SandboxTemplate, SandboxWarmPool) are structurally identical between versions but still need a storage rewrite so etcd holds them in v1beta1 form.
Two phases
The migration script executes in two distinct phases. Neither phase can happen in an arbitrary order, and if you have existing cold-start claims, skipping the bootstrap phase will immediately break them upon upgrading to v0.5.0.
Phase 1: --phase=bootstrap (Conditionally Mandatory, Pre-Upgrade)
- Mandatory for
v0.5.0: Yes, if you have existing cold-startv1alpha1claims (claims wherespec.warmpoolis empty/"none"/"default"ANDstatus.sandbox.namematches the claim name or is empty). Note: The script automatically detects whether cold-start claims exist. If none exist, it safely exits without creating anything, so it is recommended to always run it (or use--dry-runto inspect) rather than skipping it manually. - Timing: Strictly before upgrading to
v0.5.0(whilev1alpha1is still active). - What it does: Scans existing
v1alpha1claims and pre-createsshadow-pool-<template>warm pools. Thev0.5.0controller reconciler is written purely againstv1beta1.SandboxClaim, which requires a validspec.warmPoolRef.name. If you do not pre-create the shadow pools, the conversion webhook will point converted claims to non-existent infrastructure, leaving converted claims stuck with aWarmPoolNotFoundcondition while the controller repeatedly requeues them. - Why it cannot run after upgrade: Bootstrap relies on
v1alpha1field inspection. Oncev0.5.0is installed,kubectl get sandboxclaimsdefaults to returningv1beta1objects (which lackspec.sandboxTemplateRef). The script will see empty template names, log errors for every claim, and fail to create any shadow pools.
Phase 2: --phase=migrate (Optional for v0.5.0, Post-Upgrade)
- Optional for
v0.5.0(but mandatory before upgrading to a future release that dropsv1alpha1). - Timing: Strictly after upgrading to
v0.5.0(whenv1beta1is established as the storage version and the conversion webhook is live). - What it does: Patches every existing resource with a benign annotation (
agents.x-k8s.io/storage-migrated-at). This forces the API server to read thev1alpha1etcd record, pass it through the conversion webhook, and rewrite it to etcd inv1beta1storage format. While the kube-apiserver can translate older records on the fly forv0.5.0, running this phase ensures all objects are re-serialized inv1beta1format in etcd, which is required beforev1alpha1can be safely removed from the CRD definition in a future release. - Why it cannot run before upgrade: Before upgrading,
v1alpha1is still the storage version in etcd. Patching the resources will merely write an annotation ontov1alpha1etcd records, accomplishing zero storage migration.
Both phases are idempotent — safe to re-run.
Before you start: back up your data
Before running either phase, dump every CR the migration will touch so you have a known-good snapshot to fall back to if anything goes wrong:
kubectl get sandboxes,sandboxclaims,sandboxtemplates,sandboxwarmpools \
-A -o yaml > agent-sandbox-backup-$(date -u +%Y%m%dT%H%M%SZ).yaml
Keep the file somewhere durable (not on a worker pod that may get rescheduled). Useful for:
- Inspecting the original v1alpha1 shape if a converted v1beta1 record looks wrong.
- Comparing pre- vs post-migration to confirm only the expected fields changed.
- Re-creating individual mangled resources by hand without restoring the whole namespace.
See Recovery from backup in the Troubleshooting section if you need to roll back.
Migration flows
Pick one of three flows depending on how you manage installs.
Flow A — Manual via kubectl (default)
The official agent-sandbox installation path is kubectl apply -f against the release manifests (see the project README and release notes), so this is the default migration flow. Run the script directly from dev/tools/migrate.sh (a thin wrapper around helm/files/migrate.sh):
# 1. Pre-create the shadow pools BEFORE applying the new CRDs.
# Operates on v1alpha1 - this is the last step that does.
bash dev/tools/migrate.sh --phase=bootstrap
# 2. Install the new controller + CRDs (which include the conversion webhook).
# The release ships two manifests: manifest.yaml (core controller + base
# CRDs + webhook Service) and extensions.yaml (the extensions API group
# CRDs: SandboxClaim, SandboxTemplate, SandboxWarmPool). Apply both.
# Wait until the controller pod is Ready and the webhook Service has
# endpoints before proceeding.
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.5.0/manifest.yaml
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.5.0/extensions.yaml
kubectl rollout status deploy/agent-sandbox-controller -n agent-sandbox-system
kubectl wait --for=condition=Ready pods -l app=agent-sandbox-controller -n agent-sandbox-system
# Wait until the conversion webhook is responsive (this may take a few seconds after the pod starts)
until kubectl get sandboxwarmpools.extensions.agents.x-k8s.io -A >/dev/null 2>&1; do
echo "Waiting for conversion webhook to be responsive..."
sleep 2
done
# 3. Force-rewrite every resource in v1beta1 storage format.
bash dev/tools/migrate.sh --phase=migrate
If the cluster is large, scope the rewrite to one namespace at a time:
bash dev/tools/migrate.sh --phase=migrate --namespace=team-alpha
Flow B — Helm-managed, manual script
For installs managed by the Helm chart, the migration is driven manually by the operator using dev/tools/migrate.sh.
# 1. Pre-create shadow pools while v1alpha1 is still the storage version.
bash dev/tools/migrate.sh --phase=bootstrap --dry-run # inspect first
bash dev/tools/migrate.sh --phase=bootstrap
# 2. Manually apply the upgraded CRD manifests using Server-Side Apply.
# Since Helm does not upgrade CRDs on upgrade, they must be applied manually:
kubectl apply --server-side --force-conflicts -f path/to/chart/crds/
# 3. Upgrade the chart.
# (If you are using extension resources like claims, templates, or pools, make sure --set controller.extensions=true is set or enabled in values)
helm upgrade agent-sandbox ./helm/ \
--namespace agent-sandbox-system \
--reuse-values \
--set image.tag=<new-version> \
--set controller.extensions=true
# 4. Wait for the new controller + webhook to be Ready, then rewrite storage.
kubectl rollout status deploy/agent-sandbox-controller -n agent-sandbox-system
# Wait until the conversion webhook is responsive (this may take a few seconds after the pod starts)
until kubectl get sandboxwarmpools.extensions.agents.x-k8s.io -A >/dev/null 2>&1; do
echo "Waiting for conversion webhook to be responsive..."
sleep 2
done
bash dev/tools/migrate.sh --phase=migrate
Dry-runs
Both phases support --dry-run. The script prints what it would do without writing anything:
bash dev/tools/migrate.sh --phase=bootstrap --dry-run
bash dev/tools/migrate.sh --phase=migrate --dry-run
The bootstrap dry-run also prints the “operator action required” summary (claims referencing missing specific pools), which is useful to inspect even when you intend to apply.
After migration completes
Shadow pools
The bootstrap phase creates one shadow-pool-<template> per (namespace, template) combination referenced by cold-start v1alpha1 claims. They’re marked with two annotations:
agents.x-k8s.io/migration-shadow: "true"agents.x-k8s.io/migration-source-template: <template-name>
List them:
kubectl get sandboxwarmpools -A -o json \
| jq -r '.items[]
| select(.metadata.annotations["agents.x-k8s.io/migration-shadow"]=="true")
| "\(.metadata.namespace)/\(.metadata.name) (for template: \(.metadata.annotations["agents.x-k8s.io/migration-source-template"]))"'
Do not delete these pools while any v1beta1 SandboxClaim still references them via warmPoolRef. While v1alpha1 definitions remain in the codebase for webhook conversion, the v1beta1 controller reconciler has no v1alpha1 fallback logic for claims pointing to missing pools. Once you’ve manually re-pointed any remaining claims to real warm pools, the shadow pools can be cleaned up.
Re-pointing warm-started claims
The bootstrap phase intentionally skips warm-started v1alpha1 claims (those with warmpool: ""/"none"/"default" AND a bound Sandbox whose name differs from the claim’s). The webhook redirects those claims’ warmPoolRef to the pool that produced their current Sandbox (via stripRandomSuffix(sandboxName)), so they end up pointing at a real, existing pool — no shadow needed.
That said, after migration completes you may want to re-point such claims at a different pool (e.g., consolidate, or move to a shadow). The warmPoolRef.name is editable on the v1beta1 claim:
kubectl patch sandboxclaim <name> -n <ns> --type=merge \
-p '{"spec":{"warmPoolRef":{"name":"my-preferred-pool"}}}'
Operator-action items from the bootstrap summary
If bootstrap printed an OPERATOR ACTION REQUIRED section listing claims that reference specific pools which don’t currently exist, the conversion webhook will still rewrite those claims to point at those exact (missing) pool names. To make those claims work, either:
- Create the missing pools manually, OR
- Re-point the claims to existing pools via the
kubectl patchabove.
Verifying the migration worked
After the post-upgrade Job completes:
# Every resource should now have the storage-migrated-at annotation.
# jq handles annotation keys with "." and "/" correctly; kubectl jsonpath
# dot-escaping cannot reliably read keys containing "/".
kubectl get sandboxes,sandboxclaims,sandboxtemplates,sandboxwarmpools -A -o json \
| jq -r '.items[]
| "\(.kind) \(.metadata.namespace)/\(.metadata.name) -> \(.metadata.annotations["agents.x-k8s.io/storage-migrated-at"] // "<missing>")"'
To verify the actual etcd storage version, check each CRD’s status.storedVersions. The kube-apiserver records every version that has ever been used to write any record there; after the rewrite Job touches every resource, you can manually prune v1alpha1 from the list to confirm nothing v1alpha1 is left:
for crd in \
sandboxes.agents.x-k8s.io \
sandboxclaims.extensions.agents.x-k8s.io \
sandboxtemplates.extensions.agents.x-k8s.io \
sandboxwarmpools.extensions.agents.x-k8s.io; do
printf '%s: ' "${crd}"
kubectl get crd "${crd}" -o jsonpath='{.status.storedVersions}'
printf '\n'
done
If a CRD still lists ["v1alpha1","v1beta1"] after the rewrite Job succeeded, every existing record has been rewritten in v1beta1 form, but the storedVersions array is not auto-pruned. To finalize:
# Confirm no v1alpha1-only records remain, then prune storedVersions.
kubectl patch crd <crd-name> --subresource=status --type=merge \
-p '{"status":{"storedVersions":["v1beta1"]}}'
Only do this after you’ve confirmed every existing record carries agents.x-k8s.io/storage-migrated-at from the rewrite Job’s run.
Troubleshooting
Migrate phase reports failures on specific resources: re-run the script (bash dev/tools/migrate.sh --phase=migrate). It’s idempotent — already-migrated resources just get the annotation timestamp updated. If a specific resource keeps failing, fetch it (kubectl get -o yaml) and inspect what’s wrong — usually it’s a conversion-webhook error tied to a bad field combination that needs manual cleanup.
Bootstrap printed OPERATOR ACTION REQUIRED for some claims: those claims reference specific pool names that don’t currently exist. The conversion webhook will still rewrite them to point at those names — you must create the pools manually post-migration, or re-point the claims (see “Re-pointing warm-started claims” above).
Webhook connection timeouts in managed/private clusters (e.g., GKE): If you see dial tcp ... connect: connection refused or connection timeouts from the API server during the migrate phase, it is likely that the control plane VPC cannot reach the webhook target port (9443) on the worker nodes.
- By default, GKE private clusters block master-to-worker node traffic on ports other than standard ones like
443and10250. - Fix: Create a firewall rule in your GCP console allowing ingress from your GKE master node IP range to your worker nodes on TCP port
9443.
Emergency Rollback Procedure (Reverting to v1alpha1)
If the migration fails critically (e.g., the new controller fails to start, the webhook causes severe issues, or you encounter unresolvable errors) and you need to completely revert to the v1alpha1 version:
Step 1: Disable Conversion Webhooks
First, stop the API server from attempting version conversion to prevent blockages on custom resource writes and deletions:
for crd in \
sandboxes.agents.x-k8s.io \
sandboxclaims.extensions.agents.x-k8s.io \
sandboxtemplates.extensions.agents.x-k8s.io \
sandboxwarmpools.extensions.agents.x-k8s.io; do
kubectl patch crd "${crd}" --type=merge -p '{"spec":{"conversion":{"strategy":"None","webhook":null}}}'
done
Step 2: Scale down the controller deployment
Scale down the agent-sandbox-controller deployment to 0 replicas, and wait for the pods to terminate completely. This stops the controller manager from reconciling resources or creating new Sandboxes to replace deleted ones while you are cleaning up the resources:
kubectl scale deploy/agent-sandbox-controller -n agent-sandbox-system --replicas=0
kubectl wait --for=delete pod -l app=agent-sandbox-controller -n agent-sandbox-system --timeout=60s
Step 3: Delete upgraded resources
While the upgraded CRDs (supporting both v1alpha1 and v1beta1 versions) are still installed, delete all custom resources so etcd is completely emptied of v1beta1 records:
kubectl delete sandboxes,sandboxclaims,sandboxtemplates,sandboxwarmpools -A --all
Step 4: Delete shadow pools (optional)
If the bootstrap phase created shadow warm pools, delete them:
kubectl get sandboxwarmpools -A -o json \
| jq -r '.items[] | select(.metadata.annotations["agents.x-k8s.io/migration-shadow"]=="true") | "\(.metadata.namespace)/\(.metadata.name)"' \
| xargs -I {} sh -c 'kubectl delete sandboxwarmpool $(echo {} | cut -d/ -f2) -n $(echo {} | cut -d/ -f1)'
Step 5: Reset CRD storedVersions to v1alpha1
Because the API server enforces that any version in status.storedVersions must be present in spec.versions, you must patch the CRDs to list only v1alpha1 in their stored versions before downgrading the CRD definitions:
for crd in \
sandboxes.agents.x-k8s.io \
sandboxclaims.extensions.agents.x-k8s.io \
sandboxtemplates.extensions.agents.x-k8s.io \
sandboxwarmpools.extensions.agents.x-k8s.io; do
kubectl patch crd "${crd}" --subresource=status --type=merge -p '{"status":{"storedVersions":["v1alpha1"]}}'
done
Step 6: Revert the CRD manifests and Controller
Downgrade the installed components back to the old version:
- For Flow A (kubectl): Re-apply the old version’s manifests (substitute
<old-version>with your previous version, e.g.,v0.4.6):kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/<old-version>/manifest.yaml kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/<old-version>/extensions.yaml - For Flow B (Helm): Roll back the Helm release to the pre-migration revision (find the revision number using
helm history agent-sandbox):helm rollback agent-sandbox <previous-revision-number> -n agent-sandbox-system # Re-apply the old CRD versions manually: kubectl apply --server-side --force-conflicts -f path/to/old-chart/crds/
Step 7: Restore Data from Backup
Apply the backup file to restore your original v1alpha1 resources:
# Re-apply the backup (stripping status fields is recommended to allow the old controller to re-initialize them)
yq 'del(.items[].status)' backup.yaml | kubectl apply -f -
Recovery from Backup (Remaining on v1beta1)
If you intend to stay on v1beta1 but need to restore specific broken or corrupt objects from your backup:
If migration produces broken or unexpected v1beta1 resources, use the backup file from Before you start: back up your data to restore.
Per-resource restore (preferred — only touches what’s actually broken):
# Inspect a specific resource against the backup to confirm it's wrong.
kubectl get <kind> <name> -n <namespace> -o yaml \
| diff - <(yq '.items[] | select(.kind=="<kind>" and .metadata.name=="<name>")' backup.yaml)
# Delete the broken record and re-apply the v1alpha1 spec from the backup.
# The conversion webhook re-converts it on apply.
kubectl delete <kind> <name> -n <namespace>
yq '.items[] | select(.kind=="<kind>" and .metadata.name=="<name>")' backup.yaml \
| kubectl apply -f -
Bulk restore (last resort — only when many resources are broken AND the conversion webhook is functioning):
# CAUTION: deletes every Sandbox/SandboxClaim/SandboxTemplate/SandboxWarmPool
# across all namespaces, then re-creates them from the backup.
kubectl delete sandboxes,sandboxclaims,sandboxtemplates,sandboxwarmpools -A --all
kubectl apply -f backup.yaml
Caveats:
- Restoration depends on a functioning conversion webhook. If the webhook itself is broken, fix that first (typically: roll the controller image back to the pre-migration version, then re-apply the backup), or restore in two phases by first re-installing the old chart and then re-applying the backup against the old CRDs.
- The backup captures
statussubresources too. Strip them before re-apply so the controllers re-derive status from spec rather than racing your stale snapshot:yq 'del(.items[].status)' backup.yaml | kubectl apply -f -. - Backups don’t capture cluster-scoped state like
SandboxWarmPoolcontroller progress; freshly-applied pools will repopulate themselves from the template.