This commit is contained in:
2026-03-13 09:47:10 +07:00
parent fbd061b253
commit 437ca81e76

View File

@@ -2,187 +2,294 @@
## Overview
The **SDD + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.
The **SDD (Software Design Documentation) + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.
This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility.
This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and is measurable through clear KPIs and SLOs.
---
## The Documentation Matrix
| Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) |
|----------|----------------------------------|----------|------------------|----------------------|------------------------|
| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | **KPI**: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." |
| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | **SLA/SLO**: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. |
| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | **Efficiency Metrics**: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). |
| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | **Quality**: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. |
| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | **Code Health**: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. |
| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | **Compliance**: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. |
| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | **Sustainability**: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." |
| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | **Reliability**: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. |
|----------|---------------------------------|----------|------------------|----------------------|------------------------|
| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | KPI: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." |
| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | SLA/SLO: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. |
| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | Efficiency Metrics: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). |
| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | Quality: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace:" 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. |
| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | Code Health: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. |
| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | Compliance: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. |
| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | Sustainability: System Longevity. Measured by "Package Age," "Security Vulnerabilities Found," and "Migration Success Rate." | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." |
| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | Reliability: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. |
---
## Detailed Document Descriptions
## Detailed Breakdown of Each Document Type
### 1. Requirements
**Purpose**: Establish the Business North Star.
**Purpose**: Establish the Business North Star
**Why It Matters**: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes.
The Requirements document is your anchor point. It answers the fundamental question: "What problem are we solving, and how do we know we've succeeded?"
**Key Elements**:
- **User Stories**: What the user needs to accomplish
- **Business Constraints**: Budget, timeline, regulatory requirements
- **Competitive Context**: What competitors do and how you differentiate
- **Success Metrics**: Quantifiable goals that define "done"
**Key Characteristics**:
- **Business-Focused**: Written in business terms, not technical jargon
- **Boundary-Setting**: Explicitly defines what we will NOT build
- **Outcome-Oriented**: Focuses on user outcomes, not features
**Best Practices**:
- Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing
- Focus on outcomes, not solutions
- Explicitly state what you will NOT build
- Include user stories that describe the user's perspective
- Document business constraints (regulatory, legal, compliance)
- Define competitive context and market positioning
- Establish clear success metrics from day one
**Common Pitfalls to Avoid**:
- Vague descriptions like "improve user experience"
- Changing requirements without updating the document
- Not defining what's out of scope
---
### 2. Spec (Specification)
**Purpose**: Create a machine-readable technical contract.
**Purpose**: Create the Technical Contract
**Why It Matters**: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth.
The Spec serves as the Single Source of Truth for all data interfaces. It's a machine-readable definition that ensures consistency across services.
**Key Elements**:
- **API Endpoints**: All routes with HTTP methods
- **Data Types**: Strict typing with validation rules
- **Error Codes**: Comprehensive error response definitions
- **Naming Conventions**: snake_case keys, consistent patterns
**Key Characteristics**:
- **Machine-Readable**: Can be parsed by tools for validation and code generation
- **Strictly Typed**: Enforces data types and validation rules
- **Comprehensive**: Covers all endpoints, request/response formats, and error codes
**Best Practices**:
- Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC
- Automate generation of client/server code from the spec
- Run contract tests against the spec in CI/CD
- Use OpenAPI/Swagger for REST APIs or Protobuf for gRPC
- Enforce consistent naming conventions (e.g., snake_case)
- Define validation rules for all data fields
- Document all possible error responses
**Common Pitfalls to Avoid**:
- Letting the spec diverge from the implementation
- Incomplete error handling documentation
- Not versioning the API spec
---
### 3. Architecture
**Purpose**: Visualize the system structure and data flow.
**Purpose**: Visualize the System Structure
**Why It Matters**: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions.
The Architecture document provides a visual map of how components fit together. It helps identify bottlenecks and understand data flow.
**Key Elements**:
- **System Context Diagram**: Shows the system and its external dependencies
- **Database ERD**: Entity-Relationship diagrams for data model
- **Network Security Policies**: Firewall rules, service mesh configs
- **Infrastructure Maps**: Cloud resources, scaling groups
**Key Characteristics**:
- **Visual**: Uses diagrams to represent complex relationships
- **Comprehensive**: Covers system context, data flow, and infrastructure
- **Living Document**: Updated as the system evolves
**Best Practices**:
- Use Mermaid.js for diagrams-as-code (versionable, diffable)
- Update diagrams when architecture changes
- Focus on data flow and decision points
- Use Mermaid.js for diagrams-as-code (versionable in Git)
- Include multiple views: System Context, C4 model, ERDs, network topology
- Document trade-offs and architectural decisions
- Show data flow through the system
**Common Pitfalls to Avoid**:
- Over-engineering diagrams with unnecessary detail
- Not updating diagrams when the architecture changes
- Using static images instead of diagrams-as-code
---
### 4. Walkthrough
**Purpose**: Build a mental model through narrative.
**Purpose**: Build Mental Models
**Why It Matters**: Code doesn't explain *why*. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs.
The Walkthrough document explains the "why" behind the "how." It helps developers understand the rationale behind design decisions.
**Key Elements**:
- **Step-by-step traces**: End-to-end flow of user actions
- **Trade-off explanations**: Why you chose option A over B
- **The Big Picture**: How components fit together conceptually
**Key Characteristics**:
- **Narrative-Driven**: Tells a story about how the system works
- **Context-Rich**: Explains trade-offs and decisions
- **End-to-End**: Traces flows from user input to system output
**Best Practices**:
- Write in a TOUR.md file or record Loom videos
- Focus on intuition, not just mechanics
- Include "Rationale" sections for each major decision
- Document step-by-step traces of core features
- Explain architectural trade-offs and why you chose them
- Include "The Big Picture" context
- Use real examples and data flows
**Common Pitfalls to Avoid**:
- Only documenting the happy path
- Assuming developers will figure out the "why"
- Not explaining the rationale behind decisions
---
### 5. Implementation
**Purpose**: The functional reality - the actual code.
**Purpose**: The Functional Reality
**Why It Matters**: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic.
The Implementation is the actual code that does the work. In SDD, the "boring" parts are auto-generated from the Spec to ensure consistency.
**Key Elements**:
- **Business Logic**: The unique value you provide
- **Unit Tests**: Covering edge cases and error paths
- **README.md**: Local environment setup instructions
**Key Characteristics**:
- **Machine-Generated**: Types and routes auto-generated from Spec
- **Human-Written**: Business logic and helper functions
- **Tested**: Includes unit and integration tests
**Best Practices**:
- Generate boilerplate (types, routes) from the Spec
- Maintain 90%+ test coverage
- Keep README.md up-to-date for local development
- Auto-generate boring parts (types, routes) from the Spec
- Keep business logic separate from boilerplate
- Maintain comprehensive test coverage
- Document the local development setup
**Common Pitfalls to Avoid**:
- Hand-writing types that should be auto-generated
- Inconsistent code style
- Insufficient test coverage
---
### 6. Validation
**Purpose**: Automated quality gates.
**Purpose**: Enforce the Contract
**Why It Matters**: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues.
The Validation layer provides automated gates that ensure the Implementation matches the Spec. It prevents human error from reaching production.
**Key Elements**:
- **Contract Tests**: Verify implementation matches spec (Dredd, Prism)
- **Integration Tests**: Test service-to-service interactions
- **Security Scans**: SAST/SBOM analysis on every PR
**Key Characteristics**:
- **Automated**: Runs on every commit/Pull Request
- **Comprehensive**: Covers contract tests, integration tests, and security scans
- **Blocking**: Prevents merges that violate the contract
**Best Practices**:
- Run validation on every pull request
- Block merges on contract violations
- Track build success rate as a KPI
- Use contract testing tools (Dredd, Prism) to validate API contracts
- Run integration tests on every commit
- Include security scans in the CI pipeline
- Fail builds on contract violations
**Common Pitfalls to Avoid**:
- Not running tests on every commit
- Allowing manual overrides of validation gates
- Not updating tests when the Spec changes
---
### 7. Maintenance
**Purpose**: Guide for long-term health and evolution.
**Purpose**: Ensure Long-Term Health
**Why It Matters**: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up.
The Maintenance document defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software.
**Key Elements**:
- **Dependency Update Schedule**: When and how to upgrade packages
- **Secret Rotation Steps**: How to rotate credentials securely
- **DB Migration Logs**: History of schema changes
- **Tech Debt "Graveyard"**: Documented technical debt with remediation plans
**Key Characteristics**:
- **Procedural**: Step-by-step instructions for common tasks
- **Scheduled**: Includes regular maintenance windows
- **Documented**: Tracks technical debt and migration history
**Best Practices**:
- Document the "how" for common maintenance tasks
- Track package age and security vulnerabilities
- Schedule regular tech debt reviews
- Document dependency update schedules
- Create secret rotation procedures
- Track technical debt in a "Graveyard"
- Document migration history and rollback procedures
**Common Pitfalls to Avoid**:
- Ad-hoc upgrades without documentation
- Ignoring technical debt until it becomes critical
- Not testing upgrades in staging first
---
### 8. Runbook
**Purpose**: Operational life-support for production systems.
**Purpose**: Operational Life-Support
**Why It Matters**: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward.
The Runbook provides instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure.
**Key Elements**:
- **Deployment Steps**: How to deploy new versions
- **Scaling Triggers**: When and how to scale up/down
- **Backup/Restore Procedures**: Disaster recovery steps
- **"3:00 AM" Troubleshooting**: Quick fixes for common failures
**Key Characteristics**:
- **Action-Oriented**: Step-by-step instructions for common operations
- **Automated**: Infrastructure as code defines the desired state
- **Crisis-Ready**: Includes "3:00 AM" troubleshooting guides
**Best Practices**:
- Store in K8s manifests (Flux/Argo) for GitOps
- Automate as much as possible
- Test runbook procedures regularly
- Document deployment procedures
- Define scaling triggers and procedures
- Include backup and restore procedures
- Create troubleshooting guides for common issues
**Common Pitfalls to Avoid**:
- Not documenting procedures for common issues
- Not testing runbook procedures
- Not versioning runbooks with the infrastructure
---
## How to Use This Framework
## How to Use This Approach Effectively
1. **Start with Requirements** - Define the business problem and success criteria
2. **Create the Spec** - Translate requirements into machine-readable contracts
3. **Design Architecture** - Visualize how the system will work
4. **Write Walkthrough** - Document the logic and trade-offs
5. **Implement** - Build the actual code
6. **Set up Validation** - Add automated tests and gates
7. **Document Maintenance** - Plan for long-term health
8. **Create Runbook** - Define operational procedures
### Phase 1: Foundation (Week 1-2)
This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals.
1. **Create Requirements Document**
- Define the Business North Star
- Establish success metrics
- Define out-of-scope items
2. **Write the Spec**
- Define all data interfaces
- Establish naming conventions
- Document validation rules
3. **Design Architecture**
- Create system diagrams
- Document data flow
- Identify potential bottlenecks
### Phase 2: Development (Week 3+)
4. **Write Walkthrough**
- Document end-to-end flows
- Explain architectural trade-offs
- Create mental models for developers
5. **Implement Code**
- Auto-generate boring parts from Spec
- Write business logic
- Implement tests
### Phase 3: Quality Assurance
6. **Set Up Validation**
- Configure CI/CD pipeline
- Set up contract testing
- Configure security scans
7. **Create Runbook**
- Document deployment procedures
- Define scaling triggers
- Create troubleshooting guides
### Phase 4: Maintenance
8. **Document Maintenance**
- Create dependency update schedule
- Document secret rotation
- Track technical debt
---
## Key Principles for Success
1. **Separation of Concerns**: Keep business concerns separate from technical concerns
2. **Machine-Readable Contracts**: Use OpenAPI/Protobuf for specs to enable automation
3. **Automation**: Automate boring parts and validation to reduce human error
4. **Measurability**: Every document should have measurable outcomes
5. **Version Control**: Keep all documentation in Git for history and collaboration
6. **Living Documents**: Update documentation as the system evolves
7. **Audience-Focused**: Write for the intended audience's needs and knowledge level
---
## Conclusion
The SDD + GitOps Documentation Framework provides a comprehensive, structured approach to software development documentation. By following this framework, teams can ensure that:
- Business goals are clearly defined and measurable
- Technical contracts are machine-readable and enforced
- System architecture is visualized and understood
- Developers have clear mental models of the system
- Code quality is maintained through automation
- Operations are reliable and repeatable
This framework is not just about documentation—it's about creating a shared understanding across the entire team and ensuring that every decision is aligned with business goals.