Files
NATSBridge/docs/SDD_FRAMEWORK.md
2026-03-13 09:15:47 +07:00

11 KiB

SDD + GitOps Documentation Framework

Overview

The SDD + GitOps Documentation Framework is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.

This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility.


The Documentation Matrix

Document Purpose & Rationale (The "Why") Audience Format / Content Measurement (KPI/SLO) Example (SaaS Context)
Requirements The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. Founder, Team, PM Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. KPI: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users."
Spec The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. Developers, QA, Automation Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. SLA/SLO: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. A contract.yaml defining exactly how Julia sends Arrow data to Node.js. It forces user_id to be a UUID.
Architecture The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. Senior Devs, DevOps Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. Efficiency Metrics: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine).
Walkthrough The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. The Team, New Hires Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. Quality: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes.
Implementation The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. Developers, Reviewers Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. Code Health: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. The SvelteKit frontend components and the specific Julia math-processing functions.
Validation The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. CI/CD Pipeline, QA Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. Compliance: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case.
Maintenance The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. The Team, DevOps Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. Sustainability: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy."
Runbook The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. DevOps, SRE, On-call Devs Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. Reliability: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM.

Detailed Document Descriptions

1. Requirements

Purpose: Establish the Business North Star.

Why It Matters: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes.

Key Elements:

  • User Stories: What the user needs to accomplish
  • Business Constraints: Budget, timeline, regulatory requirements
  • Competitive Context: What competitors do and how you differentiate
  • Success Metrics: Quantifiable goals that define "done"

Best Practices:

  • Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing
  • Focus on outcomes, not solutions
  • Explicitly state what you will NOT build

2. Spec (Specification)

Purpose: Create a machine-readable technical contract.

Why It Matters: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth.

Key Elements:

  • API Endpoints: All routes with HTTP methods
  • Data Types: Strict typing with validation rules
  • Error Codes: Comprehensive error response definitions
  • Naming Conventions: snake_case keys, consistent patterns

Best Practices:

  • Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC
  • Automate generation of client/server code from the spec
  • Run contract tests against the spec in CI/CD

3. Architecture

Purpose: Visualize the system structure and data flow.

Why It Matters: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions.

Key Elements:

  • System Context Diagram: Shows the system and its external dependencies
  • Database ERD: Entity-Relationship diagrams for data model
  • Network Security Policies: Firewall rules, service mesh configs
  • Infrastructure Maps: Cloud resources, scaling groups

Best Practices:

  • Use Mermaid.js for diagrams-as-code (versionable, diffable)
  • Update diagrams when architecture changes
  • Focus on data flow and decision points

4. Walkthrough

Purpose: Build a mental model through narrative.

Why It Matters: Code doesn't explain why. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs.

Key Elements:

  • Step-by-step traces: End-to-end flow of user actions
  • Trade-off explanations: Why you chose option A over B
  • The Big Picture: How components fit together conceptually

Best Practices:

  • Write in a TOUR.md file or record Loom videos
  • Focus on intuition, not just mechanics
  • Include "Rationale" sections for each major decision

5. Implementation

Purpose: The functional reality - the actual code.

Why It Matters: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic.

Key Elements:

  • Business Logic: The unique value you provide
  • Unit Tests: Covering edge cases and error paths
  • README.md: Local environment setup instructions

Best Practices:

  • Generate boilerplate (types, routes) from the Spec
  • Maintain 90%+ test coverage
  • Keep README.md up-to-date for local development

6. Validation

Purpose: Automated quality gates.

Why It Matters: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues.

Key Elements:

  • Contract Tests: Verify implementation matches spec (Dredd, Prism)
  • Integration Tests: Test service-to-service interactions
  • Security Scans: SAST/SBOM analysis on every PR

Best Practices:

  • Run validation on every pull request
  • Block merges on contract violations
  • Track build success rate as a KPI

7. Maintenance

Purpose: Guide for long-term health and evolution.

Why It Matters: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up.

Key Elements:

  • Dependency Update Schedule: When and how to upgrade packages
  • Secret Rotation Steps: How to rotate credentials securely
  • DB Migration Logs: History of schema changes
  • Tech Debt "Graveyard": Documented technical debt with remediation plans

Best Practices:

  • Document the "how" for common maintenance tasks
  • Track package age and security vulnerabilities
  • Schedule regular tech debt reviews

8. Runbook

Purpose: Operational life-support for production systems.

Why It Matters: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward.

Key Elements:

  • Deployment Steps: How to deploy new versions
  • Scaling Triggers: When and how to scale up/down
  • Backup/Restore Procedures: Disaster recovery steps
  • "3:00 AM" Troubleshooting: Quick fixes for common failures

Best Practices:

  • Store in K8s manifests (Flux/Argo) for GitOps
  • Automate as much as possible
  • Test runbook procedures regularly

How to Use This Framework

  1. Start with Requirements - Define the business problem and success criteria
  2. Create the Spec - Translate requirements into machine-readable contracts
  3. Design Architecture - Visualize how the system will work
  4. Write Walkthrough - Document the logic and trade-offs
  5. Implement - Build the actual code
  6. Set up Validation - Add automated tests and gates
  7. Document Maintenance - Plan for long-term health
  8. Create Runbook - Define operational procedures

This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals.