Troubleshooting#

Actionable troubleshooting reference for every module. Jump to the section that matches your problem.


Cloud Connection Errors#

Azure Connection Errors#

Error code

Cause

Resolution

AADSTS7000222

The client secret has expired

Rotate the Service Principal secret in Azure AD, then update credentials in Tenant → Connections → Update Secret

AADSTS70011

Invalid scope or app not registered

Verify the Application (Client) ID is correct; ensure the Service Principal is registered in the correct tenant

invalid_client

The Client ID or Client Secret is incorrect

Re-check credentials in Tenant → Connections → Update Secret

AADSTS65001

User has not consented to the required API permissions

Grant admin consent to the Reply CMP app registration in Azure AD

AuthorizationFailed

The service principal lacks the required role on the subscription

Assign the Reader role (minimum) on the subscription to the service principal in Azure IAM

AWS Connection Errors#

Error code

Cause

Resolution

InvalidAccessKeyId

The Access Key ID does not exist or has been deleted

Generate a new access key in AWS IAM and update via Tenant → Connections

SignatureDoesNotMatch

The Secret Access Key is incorrect

Verify the Secret Access Key and update via Tenant → Connections → Update Secret

UnrecognizedClientException

The AWS account is suspended or the key is invalid

Verify the AWS account status and key validity in the AWS console

AccessDenied

Insufficient IAM permissions

Ensure the IAM user/role has the required read policies — see Connect a Provider for the IAM policy template

GCP Connection Errors#

Error code

Cause

Resolution

TokenResponseException

The service account key is invalid, expired, or revoked

Generate a new service account key in GCP IAM and update via Tenant → Connections

403 Forbidden

The service account lacks sufficient project permissions

Assign Viewer role (minimum) to the service account on the GCP project

404 Not Found

The GCP project ID in the connection configuration is incorrect

Verify the project ID in Tenant → Connections → Open Details


Discovery Errors#

Resources not appearing:

  1. Check that the connection completed a successful sync: Tenant → Connections → Last sync status

  2. Verify the resource type is in the supported coverage list: Supported Services

  3. Check the cloud credentials have read permission for the resource type

  4. Trigger a manual sync: Tenant → Connections → Launch Discovery


Discovery sync stuck or running for > 30 minutes:

This may indicate a credentials issue or a provider API throttle. Check the Audit Logs (Tenant → Auditing → Discovery tab) for error events. If no error is shown, contact your platform administrator.


“Not provisioned via CMP” label:

This is informational — it means the resource exists but was not created via the Reply CMP provisioning wizard. It is not an error.


FinOps Errors#

€0 / blank cost for a resource you know has spend:

  • Verify the connection has completed at least one cost ingestion (different from discovery sync): Tenant → Connections → Last cost refresh

  • Cost data is T-1 (previous day). Resources provisioned today will show €0 until tomorrow.

  • Some commitment types (Reserved Instances, Savings Plans) result in €0 marginal cost on covered resources — the cost appears on the reservation commitment line instead.


“Missing data” or gaps in cost charts:

  • Verify the billing period was active for that connection.

  • Some providers have billing data gaps for very new resources (< 24h old).

  • Cost export APIs occasionally have delays — they typically resolve within 24–48h.


Cannot create a budget:

  • Requires FinOps.Budget / Write permission — check your role.

  • Each Group node can have only one budget — check if one already exists.


Forecast toggle is greyed out:

The forecast toggle requires Actual cost type and does not work with Amortised. Switch Cost Type to Actual in the Analyze or Assess view.


Provisioning Errors#

Deployment failed — “insufficient permissions”:

The cloud connection used for the deployment does not have enough permissions (typically Contributor is needed for resource creation). Verify the service principal’s role on the subscription/project in the cloud provider console.


Resource name already in use:

Some resource types require globally unique names (e.g., Azure Storage Accounts, AWS S3 buckets). Try a different name with a unique suffix.


Deployment stuck in “Deploying” state for > 15 minutes:

This may indicate a provider-side timeout. Check the raw Terraform output for the last error. If the Terraform process was interrupted, the deployment may need manual cleanup. Contact your platform administrator.


“AI Plan Summary not available”:

Occurs when Azure OpenAI is unavailable or rate-limited. The raw Terraform output is always available as a fallback. Retry the dry-run after a few minutes.


Automation Errors#

Policy did not fire at scheduled time:

  • Verify the connection used has active credentials: Tenant → Connections — check for red expiry chip.

  • Check the Execution History tab on the policy for error messages from the last run.

  • Remember schedules are in UTC — verify the cron expression produces the expected UTC time.


Resources showing “Failed” in execution history:

  • The cloud connection may lack the required write permissions (e.g., Microsoft.Compute/virtualMachines/start/action for Azure).

  • The resource may be in a state that prevents the action (e.g., a VM being resized cannot be started or stopped).


GKE Autopilot Stop policy not working:

GKE Autopilot clusters are not supported for Stop policies. They manage their own node lifecycle. Use Standard GKE clusters if automation scheduling is required.


Monitoring Errors#

Widget shows “Query error”:

  1. Verify the connection is active: Administration → Connections — check for a red expiry chip

  2. Use the “Test query” button in the widget editor to surface the raw API error

  3. Check provider-side permissions:

    • Azure: Reader role on the subscription covers Metrics and Alerts Management. For Log Analytics (KQL), if the workspace uses workspace-centric access mode, additionally assign Log Analytics Reader on that workspace.

    • AWS: ReadOnlyAccess already includes all permissions for CloudWatch Metrics, Alarms, and Logs Insights. No extra policies needed.

    • GCP: roles/viewer includes all roles/monitoring.viewer permissions. No extra roles needed.


Widget shows no data / flat chart:

  • Widen the time range — some metrics only emit every 5–15 minutes

  • For Azure Metrics: metric name and resource type must match exactly (case-sensitive)

  • For Log Analytics: verify the workspace contains data for the selected time range


Dashboard not loading after import:

  • Confirm the .welkin-dashboard file was exported from v1.1.0 or later (format version "1")

  • All widget IDs are regenerated on import; connection references are preserved by connection ID — ensure the same connections exist in the target tenant


Alert rule stays in “Error” state:

  1. Open the rule → History tab — the error message shows the raw failure reason

  2. Common causes: expired credentials, query references a deleted resource or workspace, KQL syntax error

  3. Use “Evaluate now” after fixing configuration to confirm the error clears


Alert rule fires immediately after creation:

Expected — if the threshold condition is already met at first evaluation, the rule fires on first run by design.


Alert rule shows “No Data”:

  • Widen the Evaluation window — some metrics are sparse (5–10 minute intervals)

  • For AnyResult operator with Azure Alerts: No Data means no active alerts was returned, not a query failure


Alert email notifications not received:

  1. Confirm Email notification is enabled on the rule and the recipient address is correct

  2. Check your spam/junk folder

  3. Verify Monitoring.AlertRule / Read permission — without it, in-app notifications are also suppressed


Webhook delivery shows “Failed” in delivery history:

  1. Open the webhook → Deliveries tab and check the HTTP status code and error message

  2. Common causes:

    • 4xx: the target endpoint rejected the request — verify the URL and any authentication the receiver requires

    • 5xx: the receiver returned an error — check your endpoint’s own logs

    • Timeout: the endpoint did not respond within 10 seconds — ensure the receiver is reachable and handles requests quickly

  3. Use “Test” on the webhook page to re-validate connectivity after fixing the issue


Webhook URL is rejected on save:

  • The URL must use https:// — plain HTTP is not allowed

  • URLs that resolve to private IP ranges (e.g. 10.x, 192.168.x, 127.x) are blocked for security reasons — use a publicly accessible endpoint


Microsoft Teams webhook shows no message in the channel:

  1. Confirm the connector is still active in Teams — Teams connectors can expire or be removed by a Teams admin

  2. Verify the channel type is set to Microsoft Teams (not Webhook) — a Webhook type sends plain JSON, not a MessageCard

  3. Use the Test button to send a sample message and confirm it appears in Teams

CMP Agent Errors#

“I don’t have access to that data” from the Agent:

Your user role does not include the required permission for that query domain. Example: a Discovery-only user cannot query cost data. Review your assigned roles in Tenant → Users.


Agent response seems outdated (references deleted resources, old figures):

The Agent operates on T-1 data. Resources deleted today will not be removed from the data until tomorrow. For cost figures, the most recent data is from the previous day.


Report generation fails or times out:

Retry after a few minutes — this may be a transient Azure OpenAI availability issue. If the problem persists, use FinOps → Reports for a scheduled report instead.


Agent returns a garbled table or formatting issue:

The chat interface renders plain text and Markdown tables. Very large datasets (hundreds of rows) may overflow the display. Ask the Agent to “limit results to top 10” for manageable output.