Files
NATSBridge/docs/SDD_FRAMEWORK.md
2026-03-13 13:15:01 +07:00

15 KiB

SDD + GitOps Documentation Framework

This document defines the documentation framework for the NATSBridge project. It establishes a structured approach to creating, maintaining, and evolving technical documentation in alignment with GitOps principles—ensuring that documentation is versioned, auditable, and continuously validated alongside the codebase.


The SDD Framework: Seven Pillars of Documentation

Document Purpose (Rationale) Primary Audience Format / Content Example (SaaS Context) Measurement (KPI)
Requirements Capture the business intent — why we're building this and what success looks like. Defines boundaries and user-visible outcomes. Stakeholders, Product Owners, Lead Developers User stories, PRDs, acceptance criteria, non-functional constraints. "System must process tabular data from Julia to SvelteKit UI with <200ms latency for 5-member teams." 95% of requests complete <200ms (synthetic monitoring).
Specification The technical contract — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test. Developers, QA Engineers, CI/CD pipelines OpenAPI, Protobuf, AsyncAPI. Endpoint definitions, schemas, error codes. contract.yaml defining a NATS subject that accepts Arrow streams with snake_case headers. 100% of messages validated against spec (CI block rate).
Architecture The blueprint — how components fit together, interact, and scale. Guides system structure and trade-offs. Architects, Senior Developers, DevOps C4 diagrams, Mermaid.js, component/network/storage models. Diagram showing 6-node cluster routing traffic via Caddy → Node.js API → Julia pods. 100% of major decisions logged with trade-off analysis.
Walkthrough The story of flow — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs. New Developers, Team Members TOUR.md, Loom videos, sequence diagrams. Step-by-step traces with rationale. "UI sends JSON → Node.js wraps Claim-Check → Julia pulls Arrow data (prevents NATS overflow)." New developers ship feature in <2 days (PR timeline).
Implementation The real code — business logic, helpers, tests, configs. Where design becomes executable. Developers, Code Reviewers Source code, README.md, unit tests, setup scripts. Julia function for matrix calculation + SvelteKit component rendering table. >80% unit test coverage, <5% drift from spec.
Validation The enforcer — ensures implementation matches the spec. Blocks drift and human error. Automation servers, QA, Lead Developers CI jobs, contract tests, linting, integration checks. CI job rejects PR with camelCase field not allowed by YAML spec. <1% of PRs bypass validation gates.
Runbook The operational manual — how the system lives in production, scales, and recovers. Guides on-call engineers. DevOps, SREs, On-call Developers K8s manifests, Helm charts, Markdown guides. Deployment, scaling, backup/restore, troubleshooting. GitOps manifest ensuring 6 Julia replicas restart if memory >80%. MTTR <15 minutes for P1 incidents.

Detailed Document Descriptions

1. Requirements

Purpose: Capture the business intent — why we're building this and what success looks like. Defines boundaries and user-visible outcomes.

Why It Matters:

  • Aligns engineering efforts with business goals
  • Provides a north star for feature development
  • Establishes acceptance criteria before implementation begins
  • Creates a contract between product and engineering

Content Guidelines:

  • User stories with clear acceptance criteria (As a X, I want Y so that Z)
  • Product Requirements Documents (PRDs) with success metrics
  • Non-functional requirements (performance, security, scalability)
  • Boundary definitions (what's in scope vs. out of scope)

Best Practices:

  • Link each requirement to a measurable KPI
  • Keep requirements testable and verifiable
  • Maintain backward compatibility with existing requirements
  • Review and update requirements as business context changes

2. Specification

Purpose: The technical contract — precise rules for inputs, outputs, and data shape. Ensures consistency across dev and test.

Why It Matters:

  • Prevents implementation drift between components
  • Enables contract testing in CI/CD pipelines
  • Provides a single source of truth for data structures
  • Facilitates integration between teams

Content Guidelines:

  • API endpoint definitions (methods, paths, parameters)
  • Request/response schemas (JSON, XML, Protobuf, AsyncAPI)
  • Error codes and their meanings
  • Data validation rules and constraints
  • Rate limiting and quota definitions

Best Practices:

  • Use formal specification languages (OpenAPI 3.0+, AsyncAPI)
  • Version specifications alongside code
  • Generate client SDKs from specifications
  • Block CI on specification violations
  • Document edge cases and error scenarios

3. Architecture

Purpose: The blueprint — how components fit together, interact, and scale. Guides system structure and trade-offs.

Why It Matters:

  • Provides a mental model for system design
  • Guides technical decision-making and trade-off analysis
  • Facilitates onboarding of new architects and senior developers
  • Documents scaling and performance considerations

Content Guidelines:

  • C4 diagrams (Context, Container, Component levels)
  • Mermaid.js flowcharts for sequence diagrams
  • Component interaction diagrams
  • Network topology and data flow
  • Storage and caching strategies
  • Scaling and resilience patterns

Best Practices:

  • Use diagrams that are easy to update (Mermaid.js over static images)
  • Document trade-off decisions with Rationale Documents
  • Include scaling considerations for each component
  • Document failure modes and recovery strategies
  • Keep architecture diagrams versioned with code

4. Walkthrough

Purpose: The story of flow — shows how pieces connect end-to-end and why steps are sequenced. Builds intuition for new devs.

Why It Matters:

  • Reduces onboarding time for new developers
  • Provides context that code comments alone cannot convey
  • Explains the "why" behind architectural decisions
  • Helps identify gaps in the system design

Content Guidelines:

  • Step-by-step flow descriptions with rationale
  • Sequence diagrams showing request/response patterns
  • "Tour of the codebase" guides
  • Video walkthroughs (Loom, internal recordings)
  • Debugging and tracing examples

Best Practices:

  • Walk through real user journeys, not just technical flows
  • Include "what could go wrong" scenarios
  • Link walkthroughs to relevant code locations
  • Keep walkthroughs updated with architecture changes
  • Make walkthroughs interactive where possible

5. Implementation

Purpose: The real code — business logic, helpers, tests, configs. Where design becomes executable.

Why It Matters:

  • This is the actual artifact that runs in production
  • Code is the ultimate source of truth (when it matches spec)
  • Tests validate correctness and prevent regressions
  • Configuration files define runtime behavior

Content Guidelines:

  • Business logic implementation
  • Helper functions and utilities
  • Unit and integration tests
  • Configuration files (YAML, JSON, environment)
  • Setup and development scripts
  • Code organization and module structure

Best Practices:

  • Follow consistent code style and conventions
  • Write tests before or alongside implementation (TDD/BDD)
  • Document complex logic with inline comments
  • Keep configuration externalized and versioned
  • Use type annotations where applicable

6. Validation

Purpose: The enforcer — ensures implementation matches the spec. Blocks drift and human error.

Why It Matters:

  • Prevents breaking changes from reaching production
  • Catches specification violations early in the CI pipeline
  • Maintains data integrity and API consistency
  • Reduces manual QA effort through automation

Content Guidelines:

  • CI/CD pipeline configurations
  • Contract testing scripts
  • Linting rules and configurations
  • Integration test suites
  • Schema validation jobs
  • Security scanning and audit jobs

Best Practices:

  • Fail CI on specification violations
  • Run validation jobs on every commit and PR
  • Use automated code review tools
  • Maintain validation job health dashboard
  • Document validation failure remediation steps

7. Runbook

Purpose: The operational manual — how the system lives in production, scales, and recovers. Guides on-call engineers.

Why It Matters:

  • Reduces Mean Time To Recovery (MTTR) for incidents
  • Provides step-by-step guidance for common issues
  • Documents scaling and deployment procedures
  • Ensures operational knowledge is not siloed

Content Guidelines:

  • Deployment procedures (manual and automated)
  • Scaling instructions (horizontal/vertical)
  • Backup and restore procedures
  • Troubleshooting guides for common issues
  • Runbook entries for specific error codes
  • Contact information and escalation paths

Best Practices:

  • Write runbooks for every P1/P2 incident
  • Include exact commands and configuration snippets
  • Test runbooks periodically (chaos engineering)
  • Link runbook entries to relevant documentation
  • Keep runbooks updated when system changes

How to Use This Approach Effectively

1. Start with Requirements

Before writing any code or documentation, establish clear requirements. Ask:

  • What business problem are we solving?
  • How will we measure success?
  • What are the non-negotiable constraints?

Action: Create a docs/requirements/ directory and start with PRD.md and KPIs.md.

2. Define the Specification First

Once requirements are stable, define the technical specification. This becomes the contract for implementation.

Action: Create docs/specification/ with contract.yaml (or appropriate format) and error-codes.md.

3. Design the Architecture

With requirements and specification in place, design the architecture. Document trade-off decisions explicitly.

Action: Create docs/architecture/ with Mermaid diagrams and trade-offs.md.

4. Create Walkthroughs Early

As soon as the architecture is defined, create walkthroughs. This helps identify gaps and provides onboarding material.

Action: Create docs/walkthrough/ with TOUR.md and sequence diagrams.

5. Implement with Validation in Mind

Write implementation code that adheres to the specification. Build validation into the CI pipeline from day one.

Action: Ensure test files are co-located with implementation and run on every commit.

6. Automate Validation

Build automated validation that runs in CI/CD. This ensures spec compliance and prevents drift.

Action: Configure CI jobs to validate against specification and block PRs on violations.

7. Document Operations from Day One

Create runbook entries as soon as deployment procedures are established. Update them when incidents occur.

Action: Create docs/runbook/ with entries for deployment, scaling, and common issues.


GitOps Integration

This documentation framework aligns with GitOps principles:

GitOps Principle Documentation Alignment
Versioned All documentation lives in git, with history and audit trail
** declarative** Specifications and architecture are declarative contracts
Automated Validation jobs automate spec compliance checks
Self-Service Walkthroughs and runbooks enable self-service onboarding and operations
Observability KPIs and metrics are defined for each documentation artifact

Git Structure:

docs/
├── requirements/       # PRDs, user stories, KPIs
├── specification/      # OpenAPI, Protobuf, AsyncAPI specs
├── architecture/       # C4 diagrams, Mermaid, trade-off docs
├── walkthrough/        # TOUR.md, sequence diagrams
├── implementation/     # Source code (in src/)
├── validation/         # CI configs, test suites
└── runbook/            # Deployment, scaling, troubleshooting

Metrics and Continuous Improvement

Each documentation artifact has associated KPIs. Track these to ensure quality:

Document KPI Target
Requirements Requirement coverage 100% of features have associated requirements
Specification Spec compliance rate 100% of messages validate against spec
Architecture Decision documentation 100% of major decisions logged with trade-offs
Walkthrough New dev time-to-first-PR <2 days from onboarding to first contribution
Implementation Test coverage >80% unit test coverage
Validation Bypass rate <1% of PRs bypass validation gates
Runbook MTTR <15 minutes for P1 incidents

Review Cadence:

  • Weekly: Review KPI dashboards and documentation gaps
  • Monthly: Update documentation based on incident learnings
  • Quarterly: Full framework review and improvement

Template Examples

Requirements Template

# PRD: Feature Name

## Business Goal
[What problem are we solving?]

## Success Metrics
- [Metric 1]: Target [value]
- [Metric 2]: Target [value]

## User Stories
- As a [role], I want [feature] so that [benefit]
  - Acceptance Criteria: [details]

## Non-Functional Requirements
- Performance: [details]
- Security: [details]
- Scalability: [details]

## Out of Scope
- [What's explicitly excluded]

Specification Template

# contract.yaml
openapi: 3.0.0
info:
  title: NATSBridge API
  version: 1.0.0
paths:
  /api/v1/endpoint:
    post:
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Request'
      responses:
        '200':
          description: Success
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Response'

Architecture Template

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#3b82f6'}}}%%
flowchart TD
    A[Client] --> B[Caddy]
    B --> C[Node.js API]
    C --> D[Julia Worker]
    D --> E[NATS Cluster]
    E --> F[Storage]
    
    style A fill:#f9f9f9,stroke:#333
    style E fill:#e0e7ff,stroke:#3b82f6

Runbook Template

# Runbook: Service Restart

**Severity**: P2
**Estimated Time**: 5 minutes

## Symptoms
- Service is unresponsive
- Health checks are failing

## Steps
1. SSH to the host
2. Run: `kubectl rollout restart deployment/natsbridge`
3. Monitor: `kubectl get pods -l app=natsbridge -w`

## Rollback
- Run: `kubectl rollout undo deployment/natsbridge`

## Post-Incident
- [ ] Review logs for root cause
- [ ] Update runbook if needed

Conclusion

This SDD + GitOps Documentation Framework ensures that documentation is:

  • Structured: Seven distinct artifacts with clear purposes
  • Automated: Validation and CI/CD integration
  • Versioned: All documentation in git with history
  • Measurable: KPIs for quality and effectiveness
  • Actionable: Practical templates and examples

Use this framework as a living document—update it as your team's needs evolve.