402 lines
15 KiB
Markdown
402 lines
15 KiB
Markdown
# SDD + GitOps Documentation Framework
|
|
|
|
This document defines the documentation framework for the NATSBridge project. It establishes a structured approach to creating, maintaining, and evolving technical documentation in alignment with GitOps principles—ensuring that documentation is versioned, auditable, and continuously validated alongside the codebase.
|
|
|
|
---
|
|
|
|
## The SDD Framework: Seven Pillars of Documentation
|
|
|
|
| Document | Purpose (Rationale) | Primary Audience | Format / Content | Example (SaaS Context) | Measurement (KPI) |
|
|
|----------|---------------------|-----------------|------------------|------------------------|-------------------|
|
|
| **Requirements** | Capture the **business intent** — why we're building this and what success looks like. Defines boundaries and user-visible outcomes. | Stakeholders, Product Owners, Lead Developers | User stories, PRDs, acceptance criteria, non-functional constraints. | "System must process tabular data from Julia to SvelteKit UI with <200ms latency for 5-member teams." | 95% of requests complete <200ms (synthetic monitoring). |
|
|
| **Specification** | The **technical contract** — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test. | Developers, QA Engineers, CI/CD pipelines | OpenAPI, Protobuf, AsyncAPI. Endpoint definitions, schemas, error codes. | `contract.yaml` defining a NATS subject that accepts Arrow streams with snake_case headers. | 100% of messages validated against spec (CI block rate). |
|
|
| **Architecture** | The **blueprint** — how components fit together, interact, and scale. Guides system structure and trade-offs. | Architects, Senior Developers, DevOps | C4 diagrams, Mermaid.js, component/network/storage models. | Diagram showing 6-node cluster routing traffic via Caddy → Node.js API → Julia pods. | 100% of major decisions logged with trade-off analysis. |
|
|
| **Walkthrough** | The **story of flow** — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs. | New Developers, Team Members | TOUR.md, Loom videos, sequence diagrams. Step-by-step traces with rationale. | "UI sends JSON → Node.js wraps Claim-Check → Julia pulls Arrow data (prevents NATS overflow)." | New developers ship feature in <2 days (PR timeline). |
|
|
| **Implementation** | The **real code** — business logic, helpers, tests, configs. Where design becomes executable. | Developers, Code Reviewers | Source code, README.md, unit tests, setup scripts. | Julia function for matrix calculation + SvelteKit component rendering table. | >80% unit test coverage, <5% drift from spec. |
|
|
| **Validation** | The **enforcer** — ensures implementation matches the spec. Blocks drift and human error. | Automation servers, QA, Lead Developers | CI jobs, contract tests, linting, integration checks. | CI job rejects PR with camelCase field not allowed by YAML spec. | <1% of PRs bypass validation gates. |
|
|
| **Runbook** | The **operational manual** — how the system lives in production, scales, and recovers. Guides on-call engineers. | DevOps, SREs, On-call Developers | K8s manifests, Helm charts, Markdown guides. Deployment, scaling, backup/restore, troubleshooting. | GitOps manifest ensuring 6 Julia replicas restart if memory >80%. | MTTR <15 minutes for P1 incidents. |
|
|
|
|
---
|
|
|
|
## Detailed Document Descriptions
|
|
|
|
### 1. Requirements
|
|
|
|
**Purpose**: Capture the *business intent* — why we're building this and what success looks like. Defines boundaries and user-visible outcomes.
|
|
|
|
**Why It Matters**:
|
|
- Aligns engineering efforts with business goals
|
|
- Provides a north star for feature development
|
|
- Establishes acceptance criteria before implementation begins
|
|
- Creates a contract between product and engineering
|
|
|
|
**Content Guidelines**:
|
|
- User stories with clear acceptance criteria (As a X, I want Y so that Z)
|
|
- Product Requirements Documents (PRDs) with success metrics
|
|
- Non-functional requirements (performance, security, scalability)
|
|
- Boundary definitions (what's in scope vs. out of scope)
|
|
|
|
**Best Practices**:
|
|
- Link each requirement to a measurable KPI
|
|
- Keep requirements testable and verifiable
|
|
- Maintain backward compatibility with existing requirements
|
|
- Review and update requirements as business context changes
|
|
|
|
---
|
|
|
|
### 2. Specification
|
|
|
|
**Purpose**: The *technical contract* — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test.
|
|
|
|
**Why It Matters**:
|
|
- Prevents implementation drift between components
|
|
- Enables contract testing in CI/CD pipelines
|
|
- Provides a single source of truth for data structures
|
|
- Facilitates integration between teams
|
|
|
|
**Content Guidelines**:
|
|
- API endpoint definitions (methods, paths, parameters)
|
|
- Request/response schemas (JSON, XML, Protobuf, AsyncAPI)
|
|
- Error codes and their meanings
|
|
- Data validation rules and constraints
|
|
- Rate limiting and quota definitions
|
|
|
|
**Best Practices**:
|
|
- Use formal specification languages (OpenAPI 3.0+, AsyncAPI)
|
|
- Version specifications alongside code
|
|
- Generate client SDKs from specifications
|
|
- Block CI on specification violations
|
|
- Document edge cases and error scenarios
|
|
|
|
---
|
|
|
|
### 3. Architecture
|
|
|
|
**Purpose**: The *blueprint* — how components fit together, interact, and scale. Guides system structure and trade-offs.
|
|
|
|
**Why It Matters**:
|
|
- Provides a mental model for system design
|
|
- Guides technical decision-making and trade-off analysis
|
|
- Facilitates onboarding of new architects and senior developers
|
|
- Documents scaling and performance considerations
|
|
|
|
**Content Guidelines**:
|
|
- C4 diagrams (Context, Container, Component levels)
|
|
- Mermaid.js flowcharts for sequence diagrams
|
|
- Component interaction diagrams
|
|
- Network topology and data flow
|
|
- Storage and caching strategies
|
|
- Scaling and resilience patterns
|
|
|
|
**Best Practices**:
|
|
- Use diagrams that are easy to update (Mermaid.js over static images)
|
|
- Document trade-off decisions with Rationale Documents
|
|
- Include scaling considerations for each component
|
|
- Document failure modes and recovery strategies
|
|
- Keep architecture diagrams versioned with code
|
|
|
|
---
|
|
|
|
### 4. Walkthrough
|
|
|
|
**Purpose**: The *story of flow* — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs.
|
|
|
|
**Why It Matters**:
|
|
- Reduces onboarding time for new developers
|
|
- Provides context that code comments alone cannot convey
|
|
- Explains the "why" behind architectural decisions
|
|
- Helps identify gaps in the system design
|
|
|
|
**Content Guidelines**:
|
|
- Step-by-step flow descriptions with rationale
|
|
- Sequence diagrams showing request/response patterns
|
|
- "Tour of the codebase" guides
|
|
- Video walkthroughs (Loom, internal recordings)
|
|
- Debugging and tracing examples
|
|
|
|
**Best Practices**:
|
|
- Walk through real user journeys, not just technical flows
|
|
- Include "what could go wrong" scenarios
|
|
- Link walkthroughs to relevant code locations
|
|
- Keep walkthroughs updated with architecture changes
|
|
- Make walkthroughs interactive where possible
|
|
|
|
---
|
|
|
|
### 5. Implementation
|
|
|
|
**Purpose**: The *real code* — business logic, helpers, tests, configs. Where design becomes executable.
|
|
|
|
**Why It Matters**:
|
|
- This is the actual artifact that runs in production
|
|
- Code is the ultimate source of truth (when it matches spec)
|
|
- Tests validate correctness and prevent regressions
|
|
- Configuration files define runtime behavior
|
|
|
|
**Content Guidelines**:
|
|
- Business logic implementation
|
|
- Helper functions and utilities
|
|
- Unit and integration tests
|
|
- Configuration files (YAML, JSON, environment)
|
|
- Setup and development scripts
|
|
- Code organization and module structure
|
|
|
|
**Best Practices**:
|
|
- Follow consistent code style and conventions
|
|
- Write tests before or alongside implementation (TDD/BDD)
|
|
- Document complex logic with inline comments
|
|
- Keep configuration externalized and versioned
|
|
- Use type annotations where applicable
|
|
|
|
---
|
|
|
|
### 6. Validation
|
|
|
|
**Purpose**: The *enforcer* — ensures implementation matches the spec. Blocks drift and human error.
|
|
|
|
**Why It Matters**:
|
|
- Prevents breaking changes from reaching production
|
|
- Catches specification violations early in the CI pipeline
|
|
- Maintains data integrity and API consistency
|
|
- Reduces manual QA effort through automation
|
|
|
|
**Content Guidelines**:
|
|
- CI/CD pipeline configurations
|
|
- Contract testing scripts
|
|
- Linting rules and configurations
|
|
- Integration test suites
|
|
- Schema validation jobs
|
|
- Security scanning and audit jobs
|
|
|
|
**Best Practices**:
|
|
- Fail CI on specification violations
|
|
- Run validation jobs on every commit and PR
|
|
- Use automated code review tools
|
|
- Maintain validation job health dashboard
|
|
- Document validation failure remediation steps
|
|
|
|
---
|
|
|
|
### 7. Runbook
|
|
|
|
**Purpose**: The *operational manual* — how the system lives in production, scales, and recovers. Guides on-call engineers.
|
|
|
|
**Why It Matters**:
|
|
- Reduces Mean Time To Recovery (MTTR) for incidents
|
|
- Provides step-by-step guidance for common issues
|
|
- Documents scaling and deployment procedures
|
|
- Ensures operational knowledge is not siloed
|
|
|
|
**Content Guidelines**:
|
|
- Deployment procedures (manual and automated)
|
|
- Scaling instructions (horizontal/vertical)
|
|
- Backup and restore procedures
|
|
- Troubleshooting guides for common issues
|
|
- Runbook entries for specific error codes
|
|
- Contact information and escalation paths
|
|
|
|
**Best Practices**:
|
|
- Write runbooks for every P1/P2 incident
|
|
- Include exact commands and configuration snippets
|
|
- Test runbooks periodically (chaos engineering)
|
|
- Link runbook entries to relevant documentation
|
|
- Keep runbooks updated when system changes
|
|
|
|
---
|
|
|
|
## How to Use This Approach Effectively
|
|
|
|
### 1. Start with Requirements
|
|
|
|
Before writing any code or documentation, establish clear requirements. Ask:
|
|
- What business problem are we solving?
|
|
- How will we measure success?
|
|
- What are the non-negotiable constraints?
|
|
|
|
**Action**: Create a `docs/requirements/` directory and start with `PRD.md` and `KPIs.md`.
|
|
|
|
### 2. Define the Specification First
|
|
|
|
Once requirements are stable, define the technical specification. This becomes the contract for implementation.
|
|
|
|
**Action**: Create `docs/specification/` with `contract.yaml` (or appropriate format) and `error-codes.md`.
|
|
|
|
### 3. Design the Architecture
|
|
|
|
With requirements and specification in place, design the architecture. Document trade-off decisions explicitly.
|
|
|
|
**Action**: Create `docs/architecture/` with Mermaid diagrams and `trade-offs.md`.
|
|
|
|
### 4. Create Walkthroughs Early
|
|
|
|
As soon as the architecture is defined, create walkthroughs. This helps identify gaps and provides onboarding material.
|
|
|
|
**Action**: Create `docs/walkthrough/` with `TOUR.md` and sequence diagrams.
|
|
|
|
### 5. Implement with Validation in Mind
|
|
|
|
Write implementation code that adheres to the specification. Build validation into the CI pipeline from day one.
|
|
|
|
**Action**: Ensure test files are co-located with implementation and run on every commit.
|
|
|
|
### 6. Automate Validation
|
|
|
|
Build automated validation that runs in CI/CD. This ensures spec compliance and prevents drift.
|
|
|
|
**Action**: Configure CI jobs to validate against specification and block PRs on violations.
|
|
|
|
### 7. Document Operations from Day One
|
|
|
|
Create runbook entries as soon as deployment procedures are established. Update them when incidents occur.
|
|
|
|
**Action**: Create `docs/runbook/` with entries for deployment, scaling, and common issues.
|
|
|
|
---
|
|
|
|
## GitOps Integration
|
|
|
|
This documentation framework aligns with GitOps principles:
|
|
|
|
| GitOps Principle | Documentation Alignment |
|
|
|-----------------|------------------------|
|
|
| **Versioned** | All documentation lives in git, with history and audit trail |
|
|
| ** declarative** | Specifications and architecture are declarative contracts |
|
|
| **Automated** | Validation jobs automate spec compliance checks |
|
|
| **Self-Service** | Walkthroughs and runbooks enable self-service onboarding and operations |
|
|
| **Observability** | KPIs and metrics are defined for each documentation artifact |
|
|
|
|
**Git Structure**:
|
|
```
|
|
docs/
|
|
├── requirements/ # PRDs, user stories, KPIs
|
|
├── specification/ # OpenAPI, Protobuf, AsyncAPI specs
|
|
├── architecture/ # C4 diagrams, Mermaid, trade-off docs
|
|
├── walkthrough/ # TOUR.md, sequence diagrams
|
|
├── implementation/ # Source code (in src/)
|
|
├── validation/ # CI configs, test suites
|
|
└── runbook/ # Deployment, scaling, troubleshooting
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics and Continuous Improvement
|
|
|
|
Each documentation artifact has associated KPIs. Track these to ensure quality:
|
|
|
|
| Document | KPI | Target |
|
|
|----------|-----|--------|
|
|
| Requirements | Requirement coverage | 100% of features have associated requirements |
|
|
| Specification | Spec compliance rate | 100% of messages validate against spec |
|
|
| Architecture | Decision documentation | 100% of major decisions logged with trade-offs |
|
|
| Walkthrough | New dev time-to-first-PR | <2 days from onboarding to first contribution |
|
|
| Implementation | Test coverage | >80% unit test coverage |
|
|
| Validation | Bypass rate | <1% of PRs bypass validation gates |
|
|
| Runbook | MTTR | <15 minutes for P1 incidents |
|
|
|
|
**Review Cadence**:
|
|
- Weekly: Review KPI dashboards and documentation gaps
|
|
- Monthly: Update documentation based on incident learnings
|
|
- Quarterly: Full framework review and improvement
|
|
|
|
---
|
|
|
|
## Template Examples
|
|
|
|
### Requirements Template
|
|
```markdown
|
|
# PRD: Feature Name
|
|
|
|
## Business Goal
|
|
[What problem are we solving?]
|
|
|
|
## Success Metrics
|
|
- [Metric 1]: Target [value]
|
|
- [Metric 2]: Target [value]
|
|
|
|
## User Stories
|
|
- As a [role], I want [feature] so that [benefit]
|
|
- Acceptance Criteria: [details]
|
|
|
|
## Non-Functional Requirements
|
|
- Performance: [details]
|
|
- Security: [details]
|
|
- Scalability: [details]
|
|
|
|
## Out of Scope
|
|
- [What's explicitly excluded]
|
|
```
|
|
|
|
### Specification Template
|
|
```yaml
|
|
# contract.yaml
|
|
openapi: 3.0.0
|
|
info:
|
|
title: NATSBridge API
|
|
version: 1.0.0
|
|
paths:
|
|
/api/v1/endpoint:
|
|
post:
|
|
requestBody:
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/Request'
|
|
responses:
|
|
'200':
|
|
description: Success
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/Response'
|
|
```
|
|
|
|
### Architecture Template
|
|
```mermaid
|
|
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#3b82f6'}}}%%
|
|
flowchart TD
|
|
A[Client] --> B[Caddy]
|
|
B --> C[Node.js API]
|
|
C --> D[Julia Worker]
|
|
D --> E[NATS Cluster]
|
|
E --> F[Storage]
|
|
|
|
style A fill:#f9f9f9,stroke:#333
|
|
style E fill:#e0e7ff,stroke:#3b82f6
|
|
```
|
|
|
|
### Runbook Template
|
|
```markdown
|
|
# Runbook: Service Restart
|
|
|
|
**Severity**: P2
|
|
**Estimated Time**: 5 minutes
|
|
|
|
## Symptoms
|
|
- Service is unresponsive
|
|
- Health checks are failing
|
|
|
|
## Steps
|
|
1. SSH to the host
|
|
2. Run: `kubectl rollout restart deployment/natsbridge`
|
|
3. Monitor: `kubectl get pods -l app=natsbridge -w`
|
|
|
|
## Rollback
|
|
- Run: `kubectl rollout undo deployment/natsbridge`
|
|
|
|
## Post-Incident
|
|
- [ ] Review logs for root cause
|
|
- [ ] Update runbook if needed
|
|
```
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
This SDD + GitOps Documentation Framework ensures that documentation is:
|
|
- **Structured**: Seven distinct artifacts with clear purposes
|
|
- **Automated**: Validation and CI/CD integration
|
|
- **Versioned**: All documentation in git with history
|
|
- **Measurable**: KPIs for quality and effectiveness
|
|
- **Actionable**: Practical templates and examples
|
|
|
|
Use this framework as a living document—update it as your team's needs evolve. |