Troubleshooting#
Actionable troubleshooting reference for every module. Jump to the section that matches your problem.
Cloud Connection Errors#
Azure Connection Errors#
Error code |
Cause |
Resolution |
|---|---|---|
|
The client secret has expired |
Rotate the Service Principal secret in Azure AD, then update credentials in Tenant → Connections → Update Secret |
|
Invalid scope or app not registered |
Verify the Application (Client) ID is correct; ensure the Service Principal is registered in the correct tenant |
|
The Client ID or Client Secret is incorrect |
Re-check credentials in Tenant → Connections → Update Secret |
|
User has not consented to the required API permissions |
Grant admin consent to the Reply CMP app registration in Azure AD |
|
The service principal lacks the required role on the subscription |
Assign the |
AWS Connection Errors#
Error code |
Cause |
Resolution |
|---|---|---|
|
The Access Key ID does not exist or has been deleted |
Generate a new access key in AWS IAM and update via Tenant → Connections |
|
The Secret Access Key is incorrect |
Verify the Secret Access Key and update via Tenant → Connections → Update Secret |
|
The AWS account is suspended or the key is invalid |
Verify the AWS account status and key validity in the AWS console |
|
Insufficient IAM permissions |
Ensure the IAM user/role has the required read policies — see Connect a Provider for the IAM policy template |
GCP Connection Errors#
Error code |
Cause |
Resolution |
|---|---|---|
|
The service account key is invalid, expired, or revoked |
Generate a new service account key in GCP IAM and update via Tenant → Connections |
|
The service account lacks sufficient project permissions |
Assign |
|
The GCP project ID in the connection configuration is incorrect |
Verify the project ID in Tenant → Connections → Open Details |
Discovery Errors#
Resources not appearing:
Check that the connection completed a successful sync: Tenant → Connections → Last sync status
Verify the resource type is in the supported coverage list: Supported Services
Check the cloud credentials have read permission for the resource type
Trigger a manual sync: Tenant → Connections → Launch Discovery
Discovery sync stuck or running for > 30 minutes:
This may indicate a credentials issue or a provider API throttle. Check the Audit Logs (Tenant → Auditing → Discovery tab) for error events. If no error is shown, contact your platform administrator.
“Not provisioned via CMP” label:
This is informational — it means the resource exists but was not created via the Reply CMP provisioning wizard. It is not an error.
FinOps Errors#
€0 / blank cost for a resource you know has spend:
Verify the connection has completed at least one cost ingestion (different from discovery sync): Tenant → Connections → Last cost refresh
Cost data is T-1 (previous day). Resources provisioned today will show €0 until tomorrow.
Some commitment types (Reserved Instances, Savings Plans) result in €0 marginal cost on covered resources — the cost appears on the reservation commitment line instead.
“Missing data” or gaps in cost charts:
Verify the billing period was active for that connection.
Some providers have billing data gaps for very new resources (< 24h old).
Cost export APIs occasionally have delays — they typically resolve within 24–48h.
Cannot create a budget:
Requires
FinOps.Budget / Writepermission — check your role.Each Group node can have only one budget — check if one already exists.
Forecast toggle is greyed out:
The forecast toggle requires Actual cost type and does not work with Amortised. Switch Cost Type to Actual in the Analyze or Assess view.
Provisioning Errors#
Deployment failed — “insufficient permissions”:
The cloud connection used for the deployment does not have enough permissions (typically Contributor is needed for resource creation). Verify the service principal’s role on the subscription/project in the cloud provider console.
Resource name already in use:
Some resource types require globally unique names (e.g., Azure Storage Accounts, AWS S3 buckets). Try a different name with a unique suffix.
Deployment stuck in “Deploying” state for > 15 minutes:
This may indicate a provider-side timeout. Check the raw Terraform output for the last error. If the Terraform process was interrupted, the deployment may need manual cleanup. Contact your platform administrator.
“AI Plan Summary not available”:
Occurs when Azure OpenAI is unavailable or rate-limited. The raw Terraform output is always available as a fallback. Retry the dry-run after a few minutes.
Automation Errors#
Policy did not fire at scheduled time:
Verify the connection used has active credentials: Tenant → Connections — check for red expiry chip.
Check the Execution History tab on the policy for error messages from the last run.
Remember schedules are in UTC — verify the cron expression produces the expected UTC time.
Resources showing “Failed” in execution history:
The cloud connection may lack the required write permissions (e.g.,
Microsoft.Compute/virtualMachines/start/actionfor Azure).The resource may be in a state that prevents the action (e.g., a VM being resized cannot be started or stopped).
GKE Autopilot Stop policy not working:
GKE Autopilot clusters are not supported for Stop policies. They manage their own node lifecycle. Use Standard GKE clusters if automation scheduling is required.
Monitoring Errors#
Widget shows “Query error”:
Verify the connection is active: Administration → Connections — check for a red expiry chip
Use the “Test query” button in the widget editor to surface the raw API error
Check provider-side permissions:
Azure:
Readerrole on the subscription covers Metrics and Alerts Management. For Log Analytics (KQL), if the workspace uses workspace-centric access mode, additionally assign Log Analytics Reader on that workspace.AWS:
ReadOnlyAccessalready includes all permissions for CloudWatch Metrics, Alarms, and Logs Insights. No extra policies needed.GCP:
roles/viewerincludes allroles/monitoring.viewerpermissions. No extra roles needed.
Widget shows no data / flat chart:
Widen the time range — some metrics only emit every 5–15 minutes
For Azure Metrics: metric name and resource type must match exactly (case-sensitive)
For Log Analytics: verify the workspace contains data for the selected time range
Dashboard not loading after import:
Confirm the
.welkin-dashboardfile was exported from v1.1.0 or later (format version"1")All widget IDs are regenerated on import; connection references are preserved by connection ID — ensure the same connections exist in the target tenant
Alert rule stays in “Error” state:
Open the rule → History tab — the error message shows the raw failure reason
Common causes: expired credentials, query references a deleted resource or workspace, KQL syntax error
Use “Evaluate now” after fixing configuration to confirm the error clears
Alert rule fires immediately after creation:
Expected — if the threshold condition is already met at first evaluation, the rule fires on first run by design.
Alert rule shows “No Data”:
Widen the Evaluation window — some metrics are sparse (5–10 minute intervals)
For
AnyResultoperator with Azure Alerts:No Datameans no active alerts was returned, not a query failure
Alert email notifications not received:
Confirm Email notification is enabled on the rule and the recipient address is correct
Check your spam/junk folder
Verify
Monitoring.AlertRule / Readpermission — without it, in-app notifications are also suppressed
Webhook delivery shows “Failed” in delivery history:
Open the webhook → Deliveries tab and check the HTTP status code and error message
Common causes:
4xx: the target endpoint rejected the request — verify the URL and any authentication the receiver requires
5xx: the receiver returned an error — check your endpoint’s own logs
Timeout: the endpoint did not respond within 10 seconds — ensure the receiver is reachable and handles requests quickly
Use “Test” on the webhook page to re-validate connectivity after fixing the issue
Webhook URL is rejected on save:
The URL must use
https://— plain HTTP is not allowedURLs that resolve to private IP ranges (e.g.
10.x,192.168.x,127.x) are blocked for security reasons — use a publicly accessible endpoint
Microsoft Teams webhook shows no message in the channel:
Confirm the connector is still active in Teams — Teams connectors can expire or be removed by a Teams admin
Verify the channel type is set to
Microsoft Teams(notWebhook) — aWebhooktype sends plain JSON, not a MessageCardUse the Test button to send a sample message and confirm it appears in Teams
CMP Agent Errors#
“I don’t have access to that data” from the Agent:
Your user role does not include the required permission for that query domain. Example: a Discovery-only user cannot query cost data. Review your assigned roles in Tenant → Users.
Agent response seems outdated (references deleted resources, old figures):
The Agent operates on T-1 data. Resources deleted today will not be removed from the data until tomorrow. For cost figures, the most recent data is from the previous day.
Report generation fails or times out:
Retry after a few minutes — this may be a transient Azure OpenAI availability issue. If the problem persists, use FinOps → Reports for a scheduled report instead.
Agent returns a garbled table or formatting issue:
The chat interface renders plain text and Markdown tables. Very large datasets (hundreds of rows) may overflow the display. Ask the Agent to “limit results to top 10” for manageable output.