14 KiB
SDD + GitOps Documentation Framework
Overview
The SDD (Software Design Documentation) + GitOps Documentation Framework is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.
This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and is measurable through clear KPIs and SLOs.
The Documentation Matrix
| Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) |
|---|---|---|---|---|---|
| Requirements | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | KPI: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." |
| Spec | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | SLA/SLO: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A contract.yaml defining exactly how Julia sends Arrow data to Node.js. It forces user_id to be a UUID. |
| Architecture | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | Efficiency Metrics: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). |
| Walkthrough | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | Quality: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace:" 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. |
| Implementation | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | Code Health: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. |
| Validation | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | Compliance: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. |
| Maintenance | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | Sustainability: System Longevity. Measured by "Package Age," "Security Vulnerabilities Found," and "Migration Success Rate." | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." |
| Runbook | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | Reliability: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. |
Detailed Breakdown of Each Document Type
1. Requirements
Purpose: Establish the Business North Star
The Requirements document is your anchor point. It answers the fundamental question: "What problem are we solving, and how do we know we've succeeded?"
Key Characteristics:
- Business-Focused: Written in business terms, not technical jargon
- Boundary-Setting: Explicitly defines what we will NOT build
- Outcome-Oriented: Focuses on user outcomes, not features
Best Practices:
- Include user stories that describe the user's perspective
- Document business constraints (regulatory, legal, compliance)
- Define competitive context and market positioning
- Establish clear success metrics from day one
Common Pitfalls to Avoid:
- Vague descriptions like "improve user experience"
- Changing requirements without updating the document
- Not defining what's out of scope
2. Spec (Specification)
Purpose: Create the Technical Contract
The Spec serves as the Single Source of Truth for all data interfaces. It's a machine-readable definition that ensures consistency across services.
Key Characteristics:
- Machine-Readable: Can be parsed by tools for validation and code generation
- Strictly Typed: Enforces data types and validation rules
- Comprehensive: Covers all endpoints, request/response formats, and error codes
Best Practices:
- Use OpenAPI/Swagger for REST APIs or Protobuf for gRPC
- Enforce consistent naming conventions (e.g., snake_case)
- Define validation rules for all data fields
- Document all possible error responses
Common Pitfalls to Avoid:
- Letting the spec diverge from the implementation
- Incomplete error handling documentation
- Not versioning the API spec
3. Architecture
Purpose: Visualize the System Structure
The Architecture document provides a visual map of how components fit together. It helps identify bottlenecks and understand data flow.
Key Characteristics:
- Visual: Uses diagrams to represent complex relationships
- Comprehensive: Covers system context, data flow, and infrastructure
- Living Document: Updated as the system evolves
Best Practices:
- Use Mermaid.js for diagrams-as-code (versionable in Git)
- Include multiple views: System Context, C4 model, ERDs, network topology
- Document trade-offs and architectural decisions
- Show data flow through the system
Common Pitfalls to Avoid:
- Over-engineering diagrams with unnecessary detail
- Not updating diagrams when the architecture changes
- Using static images instead of diagrams-as-code
4. Walkthrough
Purpose: Build Mental Models
The Walkthrough document explains the "why" behind the "how." It helps developers understand the rationale behind design decisions.
Key Characteristics:
- Narrative-Driven: Tells a story about how the system works
- Context-Rich: Explains trade-offs and decisions
- End-to-End: Traces flows from user input to system output
Best Practices:
- Document step-by-step traces of core features
- Explain architectural trade-offs and why you chose them
- Include "The Big Picture" context
- Use real examples and data flows
Common Pitfalls to Avoid:
- Only documenting the happy path
- Assuming developers will figure out the "why"
- Not explaining the rationale behind decisions
5. Implementation
Purpose: The Functional Reality
The Implementation is the actual code that does the work. In SDD, the "boring" parts are auto-generated from the Spec to ensure consistency.
Key Characteristics:
- Machine-Generated: Types and routes auto-generated from Spec
- Human-Written: Business logic and helper functions
- Tested: Includes unit and integration tests
Best Practices:
- Auto-generate boring parts (types, routes) from the Spec
- Keep business logic separate from boilerplate
- Maintain comprehensive test coverage
- Document the local development setup
Common Pitfalls to Avoid:
- Hand-writing types that should be auto-generated
- Inconsistent code style
- Insufficient test coverage
6. Validation
Purpose: Enforce the Contract
The Validation layer provides automated gates that ensure the Implementation matches the Spec. It prevents human error from reaching production.
Key Characteristics:
- Automated: Runs on every commit/Pull Request
- Comprehensive: Covers contract tests, integration tests, and security scans
- Blocking: Prevents merges that violate the contract
Best Practices:
- Use contract testing tools (Dredd, Prism) to validate API contracts
- Run integration tests on every commit
- Include security scans in the CI pipeline
- Fail builds on contract violations
Common Pitfalls to Avoid:
- Not running tests on every commit
- Allowing manual overrides of validation gates
- Not updating tests when the Spec changes
7. Maintenance
Purpose: Ensure Long-Term Health
The Maintenance document defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software.
Key Characteristics:
- Procedural: Step-by-step instructions for common tasks
- Scheduled: Includes regular maintenance windows
- Documented: Tracks technical debt and migration history
Best Practices:
- Document dependency update schedules
- Create secret rotation procedures
- Track technical debt in a "Graveyard"
- Document migration history and rollback procedures
Common Pitfalls to Avoid:
- Ad-hoc upgrades without documentation
- Ignoring technical debt until it becomes critical
- Not testing upgrades in staging first
8. Runbook
Purpose: Operational Life-Support
The Runbook provides instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure.
Key Characteristics:
- Action-Oriented: Step-by-step instructions for common operations
- Automated: Infrastructure as code defines the desired state
- Crisis-Ready: Includes "3:00 AM" troubleshooting guides
Best Practices:
- Document deployment procedures
- Define scaling triggers and procedures
- Include backup and restore procedures
- Create troubleshooting guides for common issues
Common Pitfalls to Avoid:
- Not documenting procedures for common issues
- Not testing runbook procedures
- Not versioning runbooks with the infrastructure
How to Use This Approach Effectively
Phase 1: Foundation (Week 1-2)
-
Create Requirements Document
- Define the Business North Star
- Establish success metrics
- Define out-of-scope items
-
Write the Spec
- Define all data interfaces
- Establish naming conventions
- Document validation rules
-
Design Architecture
- Create system diagrams
- Document data flow
- Identify potential bottlenecks
Phase 2: Development (Week 3+)
-
Write Walkthrough
- Document end-to-end flows
- Explain architectural trade-offs
- Create mental models for developers
-
Implement Code
- Auto-generate boring parts from Spec
- Write business logic
- Implement tests
Phase 3: Quality Assurance
-
Set Up Validation
- Configure CI/CD pipeline
- Set up contract testing
- Configure security scans
-
Create Runbook
- Document deployment procedures
- Define scaling triggers
- Create troubleshooting guides
Phase 4: Maintenance
- Document Maintenance
- Create dependency update schedule
- Document secret rotation
- Track technical debt
Key Principles for Success
- Separation of Concerns: Keep business concerns separate from technical concerns
- Machine-Readable Contracts: Use OpenAPI/Protobuf for specs to enable automation
- Automation: Automate boring parts and validation to reduce human error
- Measurability: Every document should have measurable outcomes
- Version Control: Keep all documentation in Git for history and collaboration
- Living Documents: Update documentation as the system evolves
- Audience-Focused: Write for the intended audience's needs and knowledge level
Conclusion
The SDD + GitOps Documentation Framework provides a comprehensive, structured approach to software development documentation. By following this framework, teams can ensure that:
- Business goals are clearly defined and measurable
- Technical contracts are machine-readable and enforced
- System architecture is visualized and understood
- Developers have clear mental models of the system
- Code quality is maintained through automation
- Operations are reliable and repeatable
This framework is not just about documentation—it's about creating a shared understanding across the entire team and ensuring that every decision is aligned with business goals.