Open source · LLM agent over GraphSAGE

Ask your AWS bill questions. In English.

A natural-language agent that infers resource ownership from CloudTrail, IAM, and cost behaviour using a Graph Neural Network. Answers questions like why did our bill spike Tuesday? with specific resources, teams, and dollar amounts.

2.6M
Real Azure VMs in the audit set
13/15
Real-AWS attribution accuracy (87%)
+53%
Lift over best baseline (k-fold)
3
Clouds: AWS · Azure · GCP
01

Try it live

Chat with the agent over a synthetic 68-resource AWS account. The agent has 10 tools — it picks which to chain based on your question, hits the GraphSAGE-attributed scan, and answers in plain English.

~/costdna · costdna chat — live demosynthetic AWS account · 68 resources · 4 teams

Ask anything about a synthetic AWS account with 68 resources across 4 teams. The agent has 10 tools and runs on GPT-4o.

Live demo runs on GPT-4o · ~$0.01 per question · rate-limited to 5/IP/hour

02

The problem

Every FinOps team's most painful metric: the percentage of AWS spend that can't be attributed to any team. Industry estimates put it at 40–60%. Tagging is the standard answer, but tags drift. Resources are created in a hurry. Engineers leave.

Existing FinOps dashboards (CloudHealth, Vantage, Apptio) are only as good as the tags you have — and on most accounts, the tags you have aren't enough. CostDNA is the input layer: a tool that infers the missing tags from behaviour, then lets you ask English questions about the result.

03

The audit story

I had a 97% accuracy result on Microsoft's published 2.6M-VM Azure dataset. I audited it. It was a tautology.

First-cut
97%
Inflated by leak
Honest
6.9%
After audit, 100 classes
Random baseline
1%
12× lift remains
The catch

Across all 33,205 deployments in the Azure trace, 100% mapped 1:1 to a single subscription. So deployment_id, used as a graph edge, was a perfect lookup of subscription_id. LabelProp's "97%" was a graph database join, not learning.

Most engineers stop when they see a high accuracy number and ship it. I caught the leak by asking "are you sure data is accurate?" The same audit on Microsoft's Philly DL trace surfaced another partial leak: 85% of users belong to one virtual cluster. Three datasets, three different shortcuts, one consistent finding.

DatasetResourcesFirst-cutAudited shortcutHonest behavioural
Microsoft Azure2.6M VMs97%deployment_id ≡ sub (100% deterministic)6.9% (12× rand)
Microsoft Philly117K jobs89%user_id → vc (85% deterministic)14% (2× rand)

The methodological finding: production cloud attribution is mostly a metadata-lookup problem. Behavioural fingerprinting matters specifically when metadata is missing or unreliable — exactly the gap CostDNA's synthetic env reproduces.

04

How it works

Three layers, all cloud-agnostic. Only the collector layer (left) changes per cloud — the GNN and agent (right) are identical for AWS, Azure, and GCP.

STEP 1
Collect

Hardened boto3 / azure-mgmt / google-cloud collectors pull CloudTrail (or equivalent), Cost Explorer, IAM roles, VPC flow logs. Throttle-aware retry. AWS production-tested.

STEP 2
Train

Behavioural features (peak_hour, write_ratio, event_diversity, …) + LLM-derived semantic features (sentence-transformer over IAM names) → 2/4-layer GraphSAGE GNN. Auto-shrinks for small label sets.

STEP 3
Ask

10-tool LLM agent (GPT-4o, function-calling) answers natural-language questions. Tools are pure data lookups against the trained scan output — fast, deterministic, auditable.

05

10 tools the agent chains

Each tool wraps a piece of the underlying CostDNA pipeline. The LLM decides which to call (or chain) based on the visitor's question.

summarize_account

High-level rollup: total resources, by-team spend and confidence.

attribute_resource

Look up which team owns a specific resource and the why-explanation.

top_spenders

Top resources by total cost, optionally filtered by team.

find_cost_spikes

Largest spikes + Granger-causality attribution to deploys.

find_anomalies

Resources that don't fit any team — investigate manually.

search_resources

Substring match across resource IDs.

signal_history

Recent CloudTrail events + cost samples for one resource.

find_idle

Low-activity resources to consider for cleanup.

compare_teams

Side-by-side comparison: counts, spend, top resources, by type.

find_abandoned

Resources whose activity has collapsed in the recent half of the window — likely abandoned. Sorted by spend.

06

Real-AWS deployment

Provisioned a labeled AWS environment, ran per-team workload simulators on a 24/7 EC2 to generate authentic CloudTrail signal, scanned the live account. Same code that powers the live demo.

13 / 15
Per-resource accuracy (87%)
13 / 13
High-confidence (≥0.79) accuracy — 100%
13,402
CloudTrail events processed

Both wrong predictions came back with confidence below 0.7 and were correctly surfaced by find_anomalies for human review — exactly the active-learning workflow the system is designed for. Verification artifacts in docs/real-aws-evidence/.

07

Visual proof — embedding space

GraphSAGE learns a 2D-projected representation where same-team resources cluster together and unowned ones sit visibly separate.

UMAP embedding of synthetic AWS resources

Synthetic AWS env. The tan unowned cluster (vendor / legacy / orphan / shadow resources) sits visibly apart from the team clusters. The anomaly detector catches them automatically.

08

Multi-cloud architecture

The model + features + agent are cloud-agnostic — only the collector layer is provider-specific. AWS calls cloudtrail:LookupEvents; Azure calls monitor.activity_logs.list; GCP calls cloud_logging.list_entries. Same downstream pipeline.

AWS
Production-tested

13/15 = 87% on real AWS · live demo runs against this stack

Azure
Implemented · awaiting validation

Code follows azure-mgmt-resource v25 + activity-logs + cost-management v4 patterns; mocked-shape tests pass

GCP
Implemented · awaiting validation

Code follows google-cloud-asset v4 + cloud-logging protobuf payloads; mocked-shape tests pass

09

Run it yourself

Three usage patterns — CLI, REPL, web UI. All run against the same agent code that powers the live demo above.

A.One-shot CLI

bash
$ costdna ask "why did our bill spike Tuesday?" \
    --from-dir runs/today

B.Multi-turn REPL

bash
$ costdna chat --from-dir runs/today
[0] ❯ summarize this account
[1] ❯ which 5 resources are spending the most?
[2] ❯ tell me about i-0c4f3230 specifically

C.Web chat UI (Streamlit)

bash
$ pip install 'costdna[ui,agent]'
$ costdna serve
# open http://localhost:8501 → "Chat with the agent" tab

D.Docker (no install)

bash
$ docker run --rm pauti04/costdna scan --synthetic

Setup for the agent commands: pip install 'costdna[agent]' + export OPENAI_API_KEY=...

10

Stack

  • Python 3.11pandas, numpy, scikit-learn, statsmodels, networkx
  • PyTorch 2.x + PyTorch GeometricGraphSAGE classifier — 2 to 4 layers, residual
  • sentence-transformersMiniLM embeddings of IAM role names + resource IDs + tags
  • OpenAI SDKfunction-calling agent loop — pluggable LLM backend
  • boto3 (hardened)adaptive retry, throttle-aware CloudTrail lookup_events
  • azure-mgmt-* + google-cloud-*multi-cloud collectors (lazy-loaded extras)
  • Streamlit + Click + RichCLI commands + interactive chat UI
  • Terraformlabelled AWS env with CloudTrail data events + VPC Flow Logs
  • pytest + GitHub ActionsCI on every commit; Docker auto-publish on tag
  • Next.js + Vercelthis landing page + serverless agent endpoint

Open source. Forkable. Ready.

MIT licensed, hardened collectors, multi-cloud architecture, real-AWS numbers in the README. If you're hiring for cloud-cost / FinOps / ML-infra roles — I'd like to do this kind of work full-time.