Practitioner Architecture Guide · Data Center Security · April 2026

Building Your Own
AI-Augmented
Physical Security
Monitoring System

Paul Jankowski | CoreBastion Security Consulting | April 2026

A practitioner's architecture guide for data center and enterprise security teams — covering how to build a functionally equivalent system to commercial AI monitoring platforms using open-source VMS, commercial CV APIs, and LLM reasoning layers. Written for security practitioners who want to make informed build-vs.-buy decisions.

Proof of Concept Budget

A development environment capable of demonstrating the full four-layer architecture on 4–8 cameras:

$800–$1,500 Hardware (one-time)

Cloud CV API costs at PoC scale: under $50/month. Claude API for event reasoning calls: under $20/month.

Strategic Framing

Why Build Instead of Buy?

Commercial AI-augmented physical security platforms like Ambient.ai, Verkada, and HiveWatch have demonstrated what is possible when computer vision and behavioral analytics are applied to existing camera infrastructure. The results are real. What those platforms do not offer is ownership.

You license their inference engine, their data pipeline, and their operator interface. Your footage runs through their stack. For data center operators and enterprise security leaders who manage sensitive infrastructure, that is a meaningful constraint.

Case for Buying a Commercial Platform

Faster time to value. Vendor support. Pre-integrated stack. Ambient.ai, Verkada, and similar platforms have invested heavily in their models and operator UX. For many organizations, buying or licensing is the right call. The commercial platform delivers working AI-augmented monitoring faster than any internal build.

Case for Building

Data sovereignty. Your camera feeds contain operationally sensitive information about your facility's interior layout, access patterns, and staffing rhythms. In a hyperscale data center context, that data does not belong in a third-party cloud.

Control over the reasoning layer. A commercial platform's alerting logic is a black box. A custom build lets you define exactly what constitutes an anomaly, tune thresholds to your environment, and integrate operational context that a vendor cannot know.

Cost at scale. Per-camera licensing fees compound quickly across large deployments. An owned architecture has fixed infrastructure cost and near-zero marginal cost per additional camera or sensor feed.

The Honest Tradeoff

Building requires more upfront engineering effort and ongoing maintenance ownership. This guide will help you assess whether that tradeoff is worth it for your environment. The barrier is not technical complexity — it is knowing what the layers are, what each one does, and how they connect.

Who This Guide Is For

This is not a beginner's guide. It is written for security practitioners who understand physical security operations and want to make informed decisions about whether to build, buy, or partner on AI integration. Some engineering familiarity is assumed — but deep software development experience is not required to evaluate and direct this architecture.

Architecture Overview

Four Functional Layers

An Ambient-like system has four functional layers. Each can be sourced independently — which is the key architectural advantage of a build approach. These layers communicate through structured event data.

Layer	Function	Build Option	Difficulty
1	Video Management System (VMS) Ingest, store, and stream camera feeds	Frigate NVR / Milestone XProtect	Low
2	Computer Vision (CV) Object detection, behavioral analysis	AWS Rekognition / YOLO + OpenCV	Medium
3	AI Reasoning Event correlation, alert generation, operator guidance	Claude API / ChatGPT API	Low–Medium
4	Operator Interface Real-time alert display, query interface, logging	Custom React web application	Medium

The CV layer detects and classifies. The reasoning layer interprets and decides. The operator interface presents and records. The VMS ties the camera hardware to the pipeline. Each layer can be upgraded independently — you are not locked into a monolithic stack.

Layer 1

Video Management System

A VMS that can ingest RTSP streams from IP cameras, store recorded footage, and expose an API or event stream that downstream components can consume. You are not rebuilding the VMS — you are selecting one that does not lock you into a closed ecosystem.

Platform	Cost	API Access	Best For
Frigate NVR	Free / open source	Full REST API + MQTT events	Development, lab, smaller deployments
Milestone XProtect Essential+	Free tier available	MIP SDK	Enterprise environments, existing deployments
Genetec Security Center	Licensed	SDK + REST API	Unified access + video environments
Nx Witness / Network Optix	Licensed	Open REST API	Mid-market, strong integration story

For a development environment and proof of concept, Frigate is the correct starting point. It runs on Linux via Docker, integrates with commercial IP cameras via RTSP, has built-in motion zone configuration, and publishes detection events to an MQTT broker in real time. It also natively supports hardware-accelerated inference via Google Coral TPU or NVIDIA GPU.

Frigate NVR — Docker Quick Start
# Minimal deployment on a Linux host with GPU
docker run -d \
  --name frigate \
  --gpus all \
  -v /path/to/config:/config \
  -v /path/to/storage:/media/frigate \
  -p 5000:5000 \
  ghcr.io/blakeblackshear/frigate:stable

# config.yml defines cameras, detection zones,
# object classes, and MQTT broker settings.
# Every detection fires structured JSON to MQTT broker.

Practitioner Note — Frigate

Frigate is not enterprise-hardened out of the box. For a production deployment serving a data center, you will need to add authentication, network segmentation, and a proper storage architecture. For a proof of concept, it is production-capable for small camera counts and will get you to a working demo in one to two days.

Frigate Already Includes a CV Layer

Frigate includes a lightweight CV layer using YOLO models via DeepStack or its built-in detector. For many proof-of-concept deployments, Frigate's built-in detection is sufficient and eliminates the need to build a separate CV layer entirely. Start there, then layer in cloud CV APIs or custom YOLO models when you hit the limits of what Frigate's defaults can detect.

Layer 2

Computer Vision

The CV layer transforms raw video frames into structured data. It answers: what is physically present in this frame, and where? Object detection identifies people, vehicles, and items of interest. Behavioral analysis identifies conditions: loitering, intrusion into exclusion zones, tailgating, objects left behind.

Important: Claude and ChatGPT are large language models. They are not video analysis engines and cannot ingest raw video frames natively. The CV layer must exist as a separate component. This is the most technically demanding part of the build.

Option A: Cloud CV APIs — Recommended for Most Teams

API	Strengths	Relevant Capabilities
AWS Rekognition	Deep AWS integration, strong person detection	Person detection, face search, activity labels, PPE detection
Google Video Intelligence	Strong temporal analysis	Object tracking, scene change, explicit content detection
Azure Video Indexer	Microsoft ecosystem fit	People, objects, motion, scene segmentation
Roboflow	Custom model training, fast deployment	Train models on your specific camera views and threat types

For a data center context, AWS Rekognition is the natural fit if you are already operating in AWS infrastructure. You can pipe frames from Frigate to a Lambda function that calls Rekognition, then publish structured results to your reasoning layer.

Option B: Local Inference with YOLO

YOLO (You Only Look Once) is the industry-standard open-source object detection model. It runs locally on your hardware, has no per-API-call cost, and can be fine-tuned on your specific camera views and threat scenarios. The tradeoff is setup complexity and GPU requirements.

Python — Basic YOLO Detection Pipeline
from ultralytics import YOLO
import cv2

model = YOLO('yolov8n.pt')  # nano model, fast inference
cap = cv2.VideoCapture('rtsp://camera-ip/stream')

while True:
    ret, frame = cap.read()
    results = model(frame)
    # results contains bounding boxes, class labels, confidence scores
    detections = results[0].boxes.data.tolist()
    publish_to_mqtt(format_detections(detections))

An NVIDIA RTX 3060 or better handles real-time YOLO inference on 4–8 simultaneous camera streams. For larger deployments, distribute inference across multiple GPU nodes.

Layer 3 — The Most Strategically Important Layer

The AI Reasoning Layer

This is the most strategically important layer and the one most often misunderstood. Large language models do not watch video. What they do exceptionally well is reason over structured event data, correlate events across time and source, apply context-sensitive rules, and generate human-readable output.

In a physical security context, that translates to:

Event Assessment

Receiving a structured event payload from the CV layer and deciding whether it constitutes an actionable security condition — with threat level, recommended action, and operator alert text.

Multi-Event Correlation

Correlating multiple events across time — motion in a hallway followed by a door held open — to identify compound threats that single-event systems miss entirely.

Facility-Specific Context

Applying context that a generic commercial platform cannot know: which zones are restricted at which times, what shift patterns are normal, what constitutes anomalous behavior in your specific environment.

Natural Language Operator Interface

Generating operator alerts that include situational context, recommended response, and escalation criteria — and answering operator queries in plain language during an incident.

Python — Claude API Integration for Event Reasoning
import anthropic
import json

client = anthropic.Anthropic(api_key='your-key')

def assess_event(event_payload):
    prompt = f"""
You are a physical security analyst for a Tier III data center.
Analyze the following security event and determine:
1. Threat level (LOW / MEDIUM / HIGH / CRITICAL)
2. Recommended immediate action
3. Escalation required (YES / NO)
4. Operator alert text (2 sentences max)

Respond in JSON format only.

Event: {json.dumps(event_payload)}
Facility context: {FACILITY_CONTEXT}
Current time: {current_time()}
Recent events (last 15 min): {recent_events()}
"""
    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=500,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return json.loads(response.content[0].text)

FACILITY_CONTEXT Is Your Differentiator

The FACILITY_CONTEXT variable is where your system becomes differentiated from any commercial platform. This is where you encode your zone classifications, access schedules, personnel rosters, threat model, and operational rules. A commercial platform cannot know this. You do.

Multi-Event Correlation Example

A camera detects a person in a server hall at 2:14am. Access control shows no badge event for that person in the preceding 10 minutes. A second camera detected motion in the adjacent corridor at 2:11am. The reasoning layer correlates these three data points and generates a CRITICAL alert with a recommended lockdown response. A single-event system generates three separate LOW alerts.

Multi-Event Correlation — Rolling Buffer Pattern
# Maintain a rolling buffer of recent events
# Include it in every prompt to enable sequence reasoning

event_buffer = deque(maxlen=50)  # last 50 events

def recent_events(minutes=15):
    cutoff = datetime.now() - timedelta(minutes=minutes)
    return [e for e in event_buffer
            if e['timestamp'] > cutoff]

# Single-event alerting is a commodity.
# Multi-event correlation is where LLMs create value.

Layer 4

Operator Interface

The operator interface is the human layer of your system. It needs to deliver real-time alerts with enough context for the operator to act without switching tools, log all events with timestamps for post-incident review, and allow operators to query the system in plain language during an incident.

Component	Technology	Purpose
Frontend	React	Operator dashboard and real-time alert feed
Backend	Node.js or Python	Bridge between MQTT event stream, CV layer, and Claude API
Real-time transport	WebSocket	Backend to frontend real-time alert delivery
Event logging	PostgreSQL or SQLite	Event log, audit trail, post-incident review
Access control integration	ACS API (optional)	Badge events alongside camera alerts in operator view

The Conversational Query Interface

An operator who can type "show me all motion events in Zone 4 in the last two hours" or "what happened before the door alarm on Camera 12" and receive an intelligent natural-language response has a fundamentally different capability than one working through a traditional VMS interface. This is a significant differentiator that no off-the-shelf platform delivers today.

What the Interface Generates

Timestamped event logs of every detection and AI assessment. Structured logs of every human action, alert acknowledgment, and escalation decision. Query history for post-incident review. All of this is auditor-ready evidence for SOC 2 and ISO 27001 compliance programs.

Infrastructure Requirements

Hardware Requirements

The hardware requirements are modest relative to the capability delivered. A proof of concept on a small camera count can be built for $800–$1,500 in hardware.

Component	Minimum Spec	Recommended Spec	Purpose
Server / Mini PC	Intel i7, 16GB RAM	Intel i9 or Xeon, 32GB RAM	Frigate NVR host, backend services
GPU	NVIDIA RTX 3060 12GB	NVIDIA RTX 4070 or better	YOLO inference (if running local CV). Not required if using cloud CV APIs.
Storage	2TB NVMe	4–8TB NVMe RAID	Video retention
Network	1Gbps NIC	2.5Gbps or 10Gbps NIC	Camera stream ingestion
Google Coral TPU	$60–80 USB or PCIe	PCIe version preferred	Frigate hardware acceleration — no GPU needed for basic detection. Excellent PoC option.

For a proof of concept on 4–8 cameras, a consumer-grade PC with a Coral TPU handles Frigate detection without a GPU. GPU becomes necessary when running local YOLO inference at scale or training custom models.

PoC Budget Reality Check

Hardware: $800–$1,500. Cloud CV API costs for event-driven (not continuous-stream) workload: under $50/month at PoC scale. Claude API costs for discrete event reasoning calls at typical guard post event rates: under $20/month. Total monthly operating cost at PoC scale is under $75.

Scale-Up Economics

Per-camera licensing for commercial platforms typically runs $50–200/camera/year. At 500 cameras, that is $25,000–$100,000 annually in licensing alone — before implementation or support costs. A custom architecture has fixed infrastructure cost and near-zero marginal cost per camera added beyond the initial hardware investment.

Implementation Sequence

Implementation Roadmap

Six phases from PoC to production. Each phase delivers a working artifact before the next begins — no big-bang implementation.

Phase

Timeline

Scope

Deliverable

Week 1–2

Frigate NVR running on lab hardware, 2–4 cameras, MQTT broker operational. Validate camera integration and event stream output.

Working VMS with live event stream

Week 2–4

AWS Rekognition or Frigate built-in detector processing detection events. Validate structured detection JSON publishing to MQTT. Test object classification accuracy on your specific cameras.

Structured detection events on MQTT

Week 3–5

Claude API receiving event payloads and generating structured threat assessments. Build FACILITY_CONTEXT with zone classifications and access schedules. Test multi-event correlation with simulated events.

AI-assessed alerts with recommended actions

Week 5–10

React dashboard showing real-time alerts, event log, and conversational query interface. WebSocket connection for real-time delivery. PostgreSQL event logging and audit trail.

Working operator SOC interface

Week 8–12

ACS API feeding badge events into reasoning layer context. Correlate physical access events with camera detections. Test insider threat scenario correlation.

Correlated video + access events

Week 10–14

Authentication, network segmentation, logging hardening, disaster recovery plan. SOC 2 evidence documentation. Model accuracy validation and test harness.

Production-ready architecture

Compliance Implications

SOC 2 Implications

If you are building this for a data center client operating under SOC 2, the architecture has direct compliance implications — both advantages and considerations that require attention from the start.

Compliance Advantages

Continuous monitoring replaces periodic evidence collection. Your system generates timestamped event logs continuously, satisfying the 2026 SOC 2 emphasis on real-time operational visibility.
AI system governance is now a SOC 2 consideration. A custom-built system where you own the AI model selection, prompt logic, and access controls is easier to document and audit than a black-box commercial platform.
Access review evidence. The operator interface generates structured logs of every human action, alert acknowledgment, and escalation decision — giving auditors exactly the structured and timestamped evidence the 2026 criteria require.

Compliance Considerations

Third-party risk. If you use cloud CV APIs (AWS Rekognition, Google Video Intelligence), those become subprocessors under your SOC 2 scope. Document them accordingly from day one.
Data residency. If video frames are being sent to cloud APIs, understand where that data is processed and retained. For sensitive data center environments, local CV inference (YOLO on-premises) eliminates this exposure entirely.
Model validation. The 2026 SOC 2 AI criteria require documentation of accuracy checks and validation processes for AI systems. Build a test harness early that measures your model's false positive and false negative rates against your specific camera views.

Conclusion

The approach described in this paper — Frigate for video management, YOLO or cloud CV for object detection, Claude or ChatGPT for event reasoning, and a custom operator interface — produces a system with capabilities that match or exceed commercial platforms in the areas that matter most to data center and enterprise security operators: contextual reasoning, facility-specific logic, and data sovereignty. The strategic value for security consultants and operators is not just the technology. It is the ability to demonstrate, with a working system, what AI-augmented guard services actually looks like in practice — and to own that capability rather than resell someone else's.

About the Author

Paul Jankowski is the founder and principal consultant of CoreBastion Security Consulting, specializing in data center physical security, critical national infrastructure protection, and enterprise risk. A U.S. Air Force veteran with over 25 years of experience in physical security, most recently serving as Senior Manager of Data Center Physical Security at Amazon Web Services. He holds the IDCA Data Center Infrastructure Specialist (DCIS) certification and serves on the IDCA Technical Standards Committee. Published work addresses drone threat assessment, substation and fiber vulnerabilities, defense in depth, and data center resiliency. linkedin.com/in/pauljankowski