# SDD + GitOps Documentation Framework This document defines the documentation framework for the NATSBridge project. It establishes a structured approach to creating, maintaining, and evolving technical documentation in alignment with GitOps principles—ensuring that documentation is versioned, auditable, and continuously validated alongside the codebase. --- ## The SDD Framework: Seven Pillars of Documentation | Document | Purpose (Rationale) | Primary Audience | Format / Content | Example (SaaS Context) | Measurement (KPI) | |----------|---------------------|-----------------|------------------|------------------------|-------------------| | **Requirements** | Capture the **business intent** — why we're building this and what success looks like. Defines boundaries and user-visible outcomes. | Stakeholders, Product Owners, Lead Developers | User stories, PRDs, acceptance criteria, non-functional constraints. | "System must process tabular data from Julia to SvelteKit UI with <200ms latency for 5-member teams." | 95% of requests complete <200ms (synthetic monitoring). | | **Specification** | The **technical contract** — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test. | Developers, QA Engineers, CI/CD pipelines | OpenAPI, Protobuf, AsyncAPI. Endpoint definitions, schemas, error codes. | `contract.yaml` defining a NATS subject that accepts Arrow streams with snake_case headers. | 100% of messages validated against spec (CI block rate). | | **Architecture** | The **blueprint** — how components fit together, interact, and scale. Guides system structure and trade-offs. | Architects, Senior Developers, DevOps | C4 diagrams, Mermaid.js, component/network/storage models. | Diagram showing 6-node cluster routing traffic via Caddy → Node.js API → Julia pods. | 100% of major decisions logged with trade-off analysis. | | **Walkthrough** | The **story of flow** — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs. | New Developers, Team Members | TOUR.md, Loom videos, sequence diagrams. Step-by-step traces with rationale. | "UI sends JSON → Node.js wraps Claim-Check → Julia pulls Arrow data (prevents NATS overflow)." | New developers ship feature in <2 days (PR timeline). | | **Implementation** | The **real code** — business logic, helpers, tests, configs. Where design becomes executable. | Developers, Code Reviewers | Source code, README.md, unit tests, setup scripts. | Julia function for matrix calculation + SvelteKit component rendering table. | >80% unit test coverage, <5% drift from spec. | | **Validation** | The **enforcer** — ensures implementation matches the spec. Blocks drift and human error. | Automation servers, QA, Lead Developers | CI jobs, contract tests, linting, integration checks. | CI job rejects PR with camelCase field not allowed by YAML spec. | <1% of PRs bypass validation gates. | | **Runbook** | The **operational manual** — how the system lives in production, scales, and recovers. Guides on-call engineers. | DevOps, SREs, On-call Developers | K8s manifests, Helm charts, Markdown guides. Deployment, scaling, backup/restore, troubleshooting. | GitOps manifest ensuring 6 Julia replicas restart if memory >80%. | MTTR <15 minutes for P1 incidents. | --- ## Detailed Document Descriptions ### 1. Requirements **Purpose**: Capture the *business intent* — why we're building this and what success looks like. Defines boundaries and user-visible outcomes. **Why It Matters**: - Aligns engineering efforts with business goals - Provides a north star for feature development - Establishes acceptance criteria before implementation begins - Creates a contract between product and engineering **Content Guidelines**: - User stories with clear acceptance criteria (As a X, I want Y so that Z) - Product Requirements Documents (PRDs) with success metrics - Non-functional requirements (performance, security, scalability) - Boundary definitions (what's in scope vs. out of scope) **Best Practices**: - Link each requirement to a measurable KPI - Keep requirements testable and verifiable - Maintain backward compatibility with existing requirements - Review and update requirements as business context changes --- ### 2. Specification **Purpose**: The *technical contract* — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test. **Why It Matters**: - Prevents implementation drift between components - Enables contract testing in CI/CD pipelines - Provides a single source of truth for data structures - Facilitates integration between teams **Content Guidelines**: - API endpoint definitions (methods, paths, parameters) - Request/response schemas (JSON, XML, Protobuf, AsyncAPI) - Error codes and their meanings - Data validation rules and constraints - Rate limiting and quota definitions **Best Practices**: - Use formal specification languages (OpenAPI 3.0+, AsyncAPI) - Version specifications alongside code - Generate client SDKs from specifications - Block CI on specification violations - Document edge cases and error scenarios --- ### 3. Architecture **Purpose**: The *blueprint* — how components fit together, interact, and scale. Guides system structure and trade-offs. **Why It Matters**: - Provides a mental model for system design - Guides technical decision-making and trade-off analysis - Facilitates onboarding of new architects and senior developers - Documents scaling and performance considerations **Content Guidelines**: - C4 diagrams (Context, Container, Component levels) - Mermaid.js flowcharts for sequence diagrams - Component interaction diagrams - Network topology and data flow - Storage and caching strategies - Scaling and resilience patterns **Best Practices**: - Use diagrams that are easy to update (Mermaid.js over static images) - Document trade-off decisions with Rationale Documents - Include scaling considerations for each component - Document failure modes and recovery strategies - Keep architecture diagrams versioned with code --- ### 4. Walkthrough **Purpose**: The *story of flow* — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs. **Why It Matters**: - Reduces onboarding time for new developers - Provides context that code comments alone cannot convey - Explains the "why" behind architectural decisions - Helps identify gaps in the system design **Content Guidelines**: - Step-by-step flow descriptions with rationale - Sequence diagrams showing request/response patterns - "Tour of the codebase" guides - Video walkthroughs (Loom, internal recordings) - Debugging and tracing examples **Best Practices**: - Walk through real user journeys, not just technical flows - Include "what could go wrong" scenarios - Link walkthroughs to relevant code locations - Keep walkthroughs updated with architecture changes - Make walkthroughs interactive where possible --- ### 5. Implementation **Purpose**: The *real code* — business logic, helpers, tests, configs. Where design becomes executable. **Why It Matters**: - This is the actual artifact that runs in production - Code is the ultimate source of truth (when it matches spec) - Tests validate correctness and prevent regressions - Configuration files define runtime behavior **Content Guidelines**: - Business logic implementation - Helper functions and utilities - Unit and integration tests - Configuration files (YAML, JSON, environment) - Setup and development scripts - Code organization and module structure **Best Practices**: - Follow consistent code style and conventions - Write tests before or alongside implementation (TDD/BDD) - Document complex logic with inline comments - Keep configuration externalized and versioned - Use type annotations where applicable --- ### 6. Validation **Purpose**: The *enforcer* — ensures implementation matches the spec. Blocks drift and human error. **Why It Matters**: - Prevents breaking changes from reaching production - Catches specification violations early in the CI pipeline - Maintains data integrity and API consistency - Reduces manual QA effort through automation **Content Guidelines**: - CI/CD pipeline configurations - Contract testing scripts - Linting rules and configurations - Integration test suites - Schema validation jobs - Security scanning and audit jobs **Best Practices**: - Fail CI on specification violations - Run validation jobs on every commit and PR - Use automated code review tools - Maintain validation job health dashboard - Document validation failure remediation steps --- ### 7. Runbook **Purpose**: The *operational manual* — how the system lives in production, scales, and recovers. Guides on-call engineers. **Why It Matters**: - Reduces Mean Time To Recovery (MTTR) for incidents - Provides step-by-step guidance for common issues - Documents scaling and deployment procedures - Ensures operational knowledge is not siloed **Content Guidelines**: - Deployment procedures (manual and automated) - Scaling instructions (horizontal/vertical) - Backup and restore procedures - Troubleshooting guides for common issues - Runbook entries for specific error codes - Contact information and escalation paths **Best Practices**: - Write runbooks for every P1/P2 incident - Include exact commands and configuration snippets - Test runbooks periodically (chaos engineering) - Link runbook entries to relevant documentation - Keep runbooks updated when system changes --- ## How to Use This Approach Effectively ### 1. Start with Requirements Before writing any code or documentation, establish clear requirements. Ask: - What business problem are we solving? - How will we measure success? - What are the non-negotiable constraints? **Action**: Create a `docs/requirements/` directory and start with `PRD.md` and `KPIs.md`. ### 2. Define the Specification First Once requirements are stable, define the technical specification. This becomes the contract for implementation. **Action**: Create `docs/specification/` with `contract.yaml` (or appropriate format) and `error-codes.md`. ### 3. Design the Architecture With requirements and specification in place, design the architecture. Document trade-off decisions explicitly. **Action**: Create `docs/architecture/` with Mermaid diagrams and `trade-offs.md`. ### 4. Create Walkthroughs Early As soon as the architecture is defined, create walkthroughs. This helps identify gaps and provides onboarding material. **Action**: Create `docs/walkthrough/` with `TOUR.md` and sequence diagrams. ### 5. Implement with Validation in Mind Write implementation code that adheres to the specification. Build validation into the CI pipeline from day one. **Action**: Ensure test files are co-located with implementation and run on every commit. ### 6. Automate Validation Build automated validation that runs in CI/CD. This ensures spec compliance and prevents drift. **Action**: Configure CI jobs to validate against specification and block PRs on violations. ### 7. Document Operations from Day One Create runbook entries as soon as deployment procedures are established. Update them when incidents occur. **Action**: Create `docs/runbook/` with entries for deployment, scaling, and common issues. --- ## GitOps Integration This documentation framework aligns with GitOps principles: | GitOps Principle | Documentation Alignment | |-----------------|------------------------| | **Versioned** | All documentation lives in git, with history and audit trail | | ** declarative** | Specifications and architecture are declarative contracts | | **Automated** | Validation jobs automate spec compliance checks | | **Self-Service** | Walkthroughs and runbooks enable self-service onboarding and operations | | **Observability** | KPIs and metrics are defined for each documentation artifact | **Git Structure**: ``` docs/ ├── requirements/ # PRDs, user stories, KPIs ├── specification/ # OpenAPI, Protobuf, AsyncAPI specs ├── architecture/ # C4 diagrams, Mermaid, trade-off docs ├── walkthrough/ # TOUR.md, sequence diagrams ├── implementation/ # Source code (in src/) ├── validation/ # CI configs, test suites └── runbook/ # Deployment, scaling, troubleshooting ``` --- ## Metrics and Continuous Improvement Each documentation artifact has associated KPIs. Track these to ensure quality: | Document | KPI | Target | |----------|-----|--------| | Requirements | Requirement coverage | 100% of features have associated requirements | | Specification | Spec compliance rate | 100% of messages validate against spec | | Architecture | Decision documentation | 100% of major decisions logged with trade-offs | | Walkthrough | New dev time-to-first-PR | <2 days from onboarding to first contribution | | Implementation | Test coverage | >80% unit test coverage | | Validation | Bypass rate | <1% of PRs bypass validation gates | | Runbook | MTTR | <15 minutes for P1 incidents | **Review Cadence**: - Weekly: Review KPI dashboards and documentation gaps - Monthly: Update documentation based on incident learnings - Quarterly: Full framework review and improvement --- ## Template Examples ### Requirements Template ```markdown # PRD: Feature Name ## Business Goal [What problem are we solving?] ## Success Metrics - [Metric 1]: Target [value] - [Metric 2]: Target [value] ## User Stories - As a [role], I want [feature] so that [benefit] - Acceptance Criteria: [details] ## Non-Functional Requirements - Performance: [details] - Security: [details] - Scalability: [details] ## Out of Scope - [What's explicitly excluded] ``` ### Specification Template ```yaml # contract.yaml openapi: 3.0.0 info: title: NATSBridge API version: 1.0.0 paths: /api/v1/endpoint: post: requestBody: content: application/json: schema: $ref: '#/components/schemas/Request' responses: '200': description: Success content: application/json: schema: $ref: '#/components/schemas/Response' ``` ### Architecture Template ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#3b82f6'}}}%% flowchart TD A[Client] --> B[Caddy] B --> C[Node.js API] C --> D[Julia Worker] D --> E[NATS Cluster] E --> F[Storage] style A fill:#f9f9f9,stroke:#333 style E fill:#e0e7ff,stroke:#3b82f6 ``` ### Runbook Template ```markdown # Runbook: Service Restart **Severity**: P2 **Estimated Time**: 5 minutes ## Symptoms - Service is unresponsive - Health checks are failing ## Steps 1. SSH to the host 2. Run: `kubectl rollout restart deployment/natsbridge` 3. Monitor: `kubectl get pods -l app=natsbridge -w` ## Rollback - Run: `kubectl rollout undo deployment/natsbridge` ## Post-Incident - [ ] Review logs for root cause - [ ] Update runbook if needed ``` --- ## Conclusion This SDD + GitOps Documentation Framework ensures that documentation is: - **Structured**: Seven distinct artifacts with clear purposes - **Automated**: Validation and CI/CD integration - **Versioned**: All documentation in git with history - **Measurable**: KPIs for quality and effectiveness - **Actionable**: Practical templates and examples Use this framework as a living document—update it as your team's needs evolve.