15 KiB
SDD + GitOps Documentation Framework
This document defines the documentation framework for the NATSBridge project. It establishes a structured approach to creating, maintaining, and evolving technical documentation in alignment with GitOps principles—ensuring that documentation is versioned, auditable, and continuously validated alongside the codebase.
The SDD Framework: Seven Pillars of Documentation
| Document | Purpose (Rationale) | Primary Audience | Format / Content | Example (SaaS Context) | Measurement (KPI) |
|---|---|---|---|---|---|
| Requirements | Capture the business intent — why we're building this and what success looks like. Defines boundaries and user-visible outcomes. | Stakeholders, Product Owners, Lead Developers | User stories, PRDs, acceptance criteria, non-functional constraints. | "System must process tabular data from Julia to SvelteKit UI with <200ms latency for 5-member teams." | 95% of requests complete <200ms (synthetic monitoring). |
| Specification | The technical contract — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test. | Developers, QA Engineers, CI/CD pipelines | OpenAPI, Protobuf, AsyncAPI. Endpoint definitions, schemas, error codes. | contract.yaml defining a NATS subject that accepts Arrow streams with snake_case headers. |
100% of messages validated against spec (CI block rate). |
| Architecture | The blueprint — how components fit together, interact, and scale. Guides system structure and trade-offs. | Architects, Senior Developers, DevOps | C4 diagrams, Mermaid.js, component/network/storage models. | Diagram showing 6-node cluster routing traffic via Caddy → Node.js API → Julia pods. | 100% of major decisions logged with trade-off analysis. |
| Walkthrough | The story of flow — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs. | New Developers, Team Members | TOUR.md, Loom videos, sequence diagrams. Step-by-step traces with rationale. | "UI sends JSON → Node.js wraps Claim-Check → Julia pulls Arrow data (prevents NATS overflow)." | New developers ship feature in <2 days (PR timeline). |
| Implementation | The real code — business logic, helpers, tests, configs. Where design becomes executable. | Developers, Code Reviewers | Source code, README.md, unit tests, setup scripts. | Julia function for matrix calculation + SvelteKit component rendering table. | >80% unit test coverage, <5% drift from spec. |
| Validation | The enforcer — ensures implementation matches the spec. Blocks drift and human error. | Automation servers, QA, Lead Developers | CI jobs, contract tests, linting, integration checks. | CI job rejects PR with camelCase field not allowed by YAML spec. | <1% of PRs bypass validation gates. |
| Runbook | The operational manual — how the system lives in production, scales, and recovers. Guides on-call engineers. | DevOps, SREs, On-call Developers | K8s manifests, Helm charts, Markdown guides. Deployment, scaling, backup/restore, troubleshooting. | GitOps manifest ensuring 6 Julia replicas restart if memory >80%. | MTTR <15 minutes for P1 incidents. |
Detailed Document Descriptions
1. Requirements
Purpose: Capture the business intent — why we're building this and what success looks like. Defines boundaries and user-visible outcomes.
Why It Matters:
- Aligns engineering efforts with business goals
- Provides a north star for feature development
- Establishes acceptance criteria before implementation begins
- Creates a contract between product and engineering
Content Guidelines:
- User stories with clear acceptance criteria (As a X, I want Y so that Z)
- Product Requirements Documents (PRDs) with success metrics
- Non-functional requirements (performance, security, scalability)
- Boundary definitions (what's in scope vs. out of scope)
Best Practices:
- Link each requirement to a measurable KPI
- Keep requirements testable and verifiable
- Maintain backward compatibility with existing requirements
- Review and update requirements as business context changes
2. Specification
Purpose: The technical contract — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test.
Why It Matters:
- Prevents implementation drift between components
- Enables contract testing in CI/CD pipelines
- Provides a single source of truth for data structures
- Facilitates integration between teams
Content Guidelines:
- API endpoint definitions (methods, paths, parameters)
- Request/response schemas (JSON, XML, Protobuf, AsyncAPI)
- Error codes and their meanings
- Data validation rules and constraints
- Rate limiting and quota definitions
Best Practices:
- Use formal specification languages (OpenAPI 3.0+, AsyncAPI)
- Version specifications alongside code
- Generate client SDKs from specifications
- Block CI on specification violations
- Document edge cases and error scenarios
3. Architecture
Purpose: The blueprint — how components fit together, interact, and scale. Guides system structure and trade-offs.
Why It Matters:
- Provides a mental model for system design
- Guides technical decision-making and trade-off analysis
- Facilitates onboarding of new architects and senior developers
- Documents scaling and performance considerations
Content Guidelines:
- C4 diagrams (Context, Container, Component levels)
- Mermaid.js flowcharts for sequence diagrams
- Component interaction diagrams
- Network topology and data flow
- Storage and caching strategies
- Scaling and resilience patterns
Best Practices:
- Use diagrams that are easy to update (Mermaid.js over static images)
- Document trade-off decisions with Rationale Documents
- Include scaling considerations for each component
- Document failure modes and recovery strategies
- Keep architecture diagrams versioned with code
4. Walkthrough
Purpose: The story of flow — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs.
Why It Matters:
- Reduces onboarding time for new developers
- Provides context that code comments alone cannot convey
- Explains the "why" behind architectural decisions
- Helps identify gaps in the system design
Content Guidelines:
- Step-by-step flow descriptions with rationale
- Sequence diagrams showing request/response patterns
- "Tour of the codebase" guides
- Video walkthroughs (Loom, internal recordings)
- Debugging and tracing examples
Best Practices:
- Walk through real user journeys, not just technical flows
- Include "what could go wrong" scenarios
- Link walkthroughs to relevant code locations
- Keep walkthroughs updated with architecture changes
- Make walkthroughs interactive where possible
5. Implementation
Purpose: The real code — business logic, helpers, tests, configs. Where design becomes executable.
Why It Matters:
- This is the actual artifact that runs in production
- Code is the ultimate source of truth (when it matches spec)
- Tests validate correctness and prevent regressions
- Configuration files define runtime behavior
Content Guidelines:
- Business logic implementation
- Helper functions and utilities
- Unit and integration tests
- Configuration files (YAML, JSON, environment)
- Setup and development scripts
- Code organization and module structure
Best Practices:
- Follow consistent code style and conventions
- Write tests before or alongside implementation (TDD/BDD)
- Document complex logic with inline comments
- Keep configuration externalized and versioned
- Use type annotations where applicable
6. Validation
Purpose: The enforcer — ensures implementation matches the spec. Blocks drift and human error.
Why It Matters:
- Prevents breaking changes from reaching production
- Catches specification violations early in the CI pipeline
- Maintains data integrity and API consistency
- Reduces manual QA effort through automation
Content Guidelines:
- CI/CD pipeline configurations
- Contract testing scripts
- Linting rules and configurations
- Integration test suites
- Schema validation jobs
- Security scanning and audit jobs
Best Practices:
- Fail CI on specification violations
- Run validation jobs on every commit and PR
- Use automated code review tools
- Maintain validation job health dashboard
- Document validation failure remediation steps
7. Runbook
Purpose: The operational manual — how the system lives in production, scales, and recovers. Guides on-call engineers.
Why It Matters:
- Reduces Mean Time To Recovery (MTTR) for incidents
- Provides step-by-step guidance for common issues
- Documents scaling and deployment procedures
- Ensures operational knowledge is not siloed
Content Guidelines:
- Deployment procedures (manual and automated)
- Scaling instructions (horizontal/vertical)
- Backup and restore procedures
- Troubleshooting guides for common issues
- Runbook entries for specific error codes
- Contact information and escalation paths
Best Practices:
- Write runbooks for every P1/P2 incident
- Include exact commands and configuration snippets
- Test runbooks periodically (chaos engineering)
- Link runbook entries to relevant documentation
- Keep runbooks updated when system changes
How to Use This Approach Effectively
1. Start with Requirements
Before writing any code or documentation, establish clear requirements. Ask:
- What business problem are we solving?
- How will we measure success?
- What are the non-negotiable constraints?
Action: Create a docs/requirements/ directory and start with PRD.md and KPIs.md.
2. Define the Specification First
Once requirements are stable, define the technical specification. This becomes the contract for implementation.
Action: Create docs/specification/ with contract.yaml (or appropriate format) and error-codes.md.
3. Design the Architecture
With requirements and specification in place, design the architecture. Document trade-off decisions explicitly.
Action: Create docs/architecture/ with Mermaid diagrams and trade-offs.md.
4. Create Walkthroughs Early
As soon as the architecture is defined, create walkthroughs. This helps identify gaps and provides onboarding material.
Action: Create docs/walkthrough/ with TOUR.md and sequence diagrams.
5. Implement with Validation in Mind
Write implementation code that adheres to the specification. Build validation into the CI pipeline from day one.
Action: Ensure test files are co-located with implementation and run on every commit.
6. Automate Validation
Build automated validation that runs in CI/CD. This ensures spec compliance and prevents drift.
Action: Configure CI jobs to validate against specification and block PRs on violations.
7. Document Operations from Day One
Create runbook entries as soon as deployment procedures are established. Update them when incidents occur.
Action: Create docs/runbook/ with entries for deployment, scaling, and common issues.
GitOps Integration
This documentation framework aligns with GitOps principles:
| GitOps Principle | Documentation Alignment |
|---|---|
| Versioned | All documentation lives in git, with history and audit trail |
| ** declarative** | Specifications and architecture are declarative contracts |
| Automated | Validation jobs automate spec compliance checks |
| Self-Service | Walkthroughs and runbooks enable self-service onboarding and operations |
| Observability | KPIs and metrics are defined for each documentation artifact |
Git Structure:
docs/
├── requirements/ # PRDs, user stories, KPIs
├── specification/ # OpenAPI, Protobuf, AsyncAPI specs
├── architecture/ # C4 diagrams, Mermaid, trade-off docs
├── walkthrough/ # TOUR.md, sequence diagrams
├── implementation/ # Source code (in src/)
├── validation/ # CI configs, test suites
└── runbook/ # Deployment, scaling, troubleshooting
Metrics and Continuous Improvement
Each documentation artifact has associated KPIs. Track these to ensure quality:
| Document | KPI | Target |
|---|---|---|
| Requirements | Requirement coverage | 100% of features have associated requirements |
| Specification | Spec compliance rate | 100% of messages validate against spec |
| Architecture | Decision documentation | 100% of major decisions logged with trade-offs |
| Walkthrough | New dev time-to-first-PR | <2 days from onboarding to first contribution |
| Implementation | Test coverage | >80% unit test coverage |
| Validation | Bypass rate | <1% of PRs bypass validation gates |
| Runbook | MTTR | <15 minutes for P1 incidents |
Review Cadence:
- Weekly: Review KPI dashboards and documentation gaps
- Monthly: Update documentation based on incident learnings
- Quarterly: Full framework review and improvement
Template Examples
Requirements Template
# PRD: Feature Name
## Business Goal
[What problem are we solving?]
## Success Metrics
- [Metric 1]: Target [value]
- [Metric 2]: Target [value]
## User Stories
- As a [role], I want [feature] so that [benefit]
- Acceptance Criteria: [details]
## Non-Functional Requirements
- Performance: [details]
- Security: [details]
- Scalability: [details]
## Out of Scope
- [What's explicitly excluded]
Specification Template
# contract.yaml
openapi: 3.0.0
info:
title: NATSBridge API
version: 1.0.0
paths:
/api/v1/endpoint:
post:
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/Request'
responses:
'200':
description: Success
content:
application/json:
schema:
$ref: '#/components/schemas/Response'
Architecture Template
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#3b82f6'}}}%%
flowchart TD
A[Client] --> B[Caddy]
B --> C[Node.js API]
C --> D[Julia Worker]
D --> E[NATS Cluster]
E --> F[Storage]
style A fill:#f9f9f9,stroke:#333
style E fill:#e0e7ff,stroke:#3b82f6
Runbook Template
# Runbook: Service Restart
**Severity**: P2
**Estimated Time**: 5 minutes
## Symptoms
- Service is unresponsive
- Health checks are failing
## Steps
1. SSH to the host
2. Run: `kubectl rollout restart deployment/natsbridge`
3. Monitor: `kubectl get pods -l app=natsbridge -w`
## Rollback
- Run: `kubectl rollout undo deployment/natsbridge`
## Post-Incident
- [ ] Review logs for root cause
- [ ] Update runbook if needed
Conclusion
This SDD + GitOps Documentation Framework ensures that documentation is:
- Structured: Seven distinct artifacts with clear purposes
- Automated: Validation and CI/CD integration
- Versioned: All documentation in git with history
- Measurable: KPIs for quality and effectiveness
- Actionable: Practical templates and examples
Use this framework as a living document—update it as your team's needs evolve.