This commit is contained in:
2026-03-13 09:47:10 +07:00
parent fbd061b253
commit 437ca81e76

View File

@@ -2,187 +2,294 @@
## Overview ## Overview
The **SDD + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns. The **SDD (Software Design Documentation) + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.
This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility. This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and is measurable through clear KPIs and SLOs.
--- ---
## The Documentation Matrix ## The Documentation Matrix
| Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) | | Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) |
|----------|----------------------------------|----------|------------------|----------------------|------------------------| |----------|---------------------------------|----------|------------------|----------------------|------------------------|
| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | **KPI**: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." | | **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | KPI: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." |
| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | **SLA/SLO**: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. | | **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | SLA/SLO: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. |
| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | **Efficiency Metrics**: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). | | **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | Efficiency Metrics: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). |
| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | **Quality**: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. | | **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | Quality: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace:" 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. |
| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | **Code Health**: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. | | **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | Code Health: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. |
| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | **Compliance**: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. | | **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | Compliance: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. |
| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | **Sustainability**: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." | | **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | Sustainability: System Longevity. Measured by "Package Age," "Security Vulnerabilities Found," and "Migration Success Rate." | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." |
| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | **Reliability**: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. | | **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | Reliability: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. |
--- ---
## Detailed Document Descriptions ## Detailed Breakdown of Each Document Type
### 1. Requirements ### 1. Requirements
**Purpose**: Establish the Business North Star. **Purpose**: Establish the Business North Star
**Why It Matters**: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes. The Requirements document is your anchor point. It answers the fundamental question: "What problem are we solving, and how do we know we've succeeded?"
**Key Elements**: **Key Characteristics**:
- **User Stories**: What the user needs to accomplish - **Business-Focused**: Written in business terms, not technical jargon
- **Business Constraints**: Budget, timeline, regulatory requirements - **Boundary-Setting**: Explicitly defines what we will NOT build
- **Competitive Context**: What competitors do and how you differentiate - **Outcome-Oriented**: Focuses on user outcomes, not features
- **Success Metrics**: Quantifiable goals that define "done"
**Best Practices**: **Best Practices**:
- Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing - Include user stories that describe the user's perspective
- Focus on outcomes, not solutions - Document business constraints (regulatory, legal, compliance)
- Explicitly state what you will NOT build - Define competitive context and market positioning
- Establish clear success metrics from day one
**Common Pitfalls to Avoid**:
- Vague descriptions like "improve user experience"
- Changing requirements without updating the document
- Not defining what's out of scope
--- ---
### 2. Spec (Specification) ### 2. Spec (Specification)
**Purpose**: Create a machine-readable technical contract. **Purpose**: Create the Technical Contract
**Why It Matters**: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth. The Spec serves as the Single Source of Truth for all data interfaces. It's a machine-readable definition that ensures consistency across services.
**Key Elements**: **Key Characteristics**:
- **API Endpoints**: All routes with HTTP methods - **Machine-Readable**: Can be parsed by tools for validation and code generation
- **Data Types**: Strict typing with validation rules - **Strictly Typed**: Enforces data types and validation rules
- **Error Codes**: Comprehensive error response definitions - **Comprehensive**: Covers all endpoints, request/response formats, and error codes
- **Naming Conventions**: snake_case keys, consistent patterns
**Best Practices**: **Best Practices**:
- Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC - Use OpenAPI/Swagger for REST APIs or Protobuf for gRPC
- Automate generation of client/server code from the spec - Enforce consistent naming conventions (e.g., snake_case)
- Run contract tests against the spec in CI/CD - Define validation rules for all data fields
- Document all possible error responses
**Common Pitfalls to Avoid**:
- Letting the spec diverge from the implementation
- Incomplete error handling documentation
- Not versioning the API spec
--- ---
### 3. Architecture ### 3. Architecture
**Purpose**: Visualize the system structure and data flow. **Purpose**: Visualize the System Structure
**Why It Matters**: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions. The Architecture document provides a visual map of how components fit together. It helps identify bottlenecks and understand data flow.
**Key Elements**: **Key Characteristics**:
- **System Context Diagram**: Shows the system and its external dependencies - **Visual**: Uses diagrams to represent complex relationships
- **Database ERD**: Entity-Relationship diagrams for data model - **Comprehensive**: Covers system context, data flow, and infrastructure
- **Network Security Policies**: Firewall rules, service mesh configs - **Living Document**: Updated as the system evolves
- **Infrastructure Maps**: Cloud resources, scaling groups
**Best Practices**: **Best Practices**:
- Use Mermaid.js for diagrams-as-code (versionable, diffable) - Use Mermaid.js for diagrams-as-code (versionable in Git)
- Update diagrams when architecture changes - Include multiple views: System Context, C4 model, ERDs, network topology
- Focus on data flow and decision points - Document trade-offs and architectural decisions
- Show data flow through the system
**Common Pitfalls to Avoid**:
- Over-engineering diagrams with unnecessary detail
- Not updating diagrams when the architecture changes
- Using static images instead of diagrams-as-code
--- ---
### 4. Walkthrough ### 4. Walkthrough
**Purpose**: Build a mental model through narrative. **Purpose**: Build Mental Models
**Why It Matters**: Code doesn't explain *why*. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs. The Walkthrough document explains the "why" behind the "how." It helps developers understand the rationale behind design decisions.
**Key Elements**: **Key Characteristics**:
- **Step-by-step traces**: End-to-end flow of user actions - **Narrative-Driven**: Tells a story about how the system works
- **Trade-off explanations**: Why you chose option A over B - **Context-Rich**: Explains trade-offs and decisions
- **The Big Picture**: How components fit together conceptually - **End-to-End**: Traces flows from user input to system output
**Best Practices**: **Best Practices**:
- Write in a TOUR.md file or record Loom videos - Document step-by-step traces of core features
- Focus on intuition, not just mechanics - Explain architectural trade-offs and why you chose them
- Include "Rationale" sections for each major decision - Include "The Big Picture" context
- Use real examples and data flows
**Common Pitfalls to Avoid**:
- Only documenting the happy path
- Assuming developers will figure out the "why"
- Not explaining the rationale behind decisions
--- ---
### 5. Implementation ### 5. Implementation
**Purpose**: The functional reality - the actual code. **Purpose**: The Functional Reality
**Why It Matters**: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic. The Implementation is the actual code that does the work. In SDD, the "boring" parts are auto-generated from the Spec to ensure consistency.
**Key Elements**: **Key Characteristics**:
- **Business Logic**: The unique value you provide - **Machine-Generated**: Types and routes auto-generated from Spec
- **Unit Tests**: Covering edge cases and error paths - **Human-Written**: Business logic and helper functions
- **README.md**: Local environment setup instructions - **Tested**: Includes unit and integration tests
**Best Practices**: **Best Practices**:
- Generate boilerplate (types, routes) from the Spec - Auto-generate boring parts (types, routes) from the Spec
- Maintain 90%+ test coverage - Keep business logic separate from boilerplate
- Keep README.md up-to-date for local development - Maintain comprehensive test coverage
- Document the local development setup
**Common Pitfalls to Avoid**:
- Hand-writing types that should be auto-generated
- Inconsistent code style
- Insufficient test coverage
--- ---
### 6. Validation ### 6. Validation
**Purpose**: Automated quality gates. **Purpose**: Enforce the Contract
**Why It Matters**: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues. The Validation layer provides automated gates that ensure the Implementation matches the Spec. It prevents human error from reaching production.
**Key Elements**: **Key Characteristics**:
- **Contract Tests**: Verify implementation matches spec (Dredd, Prism) - **Automated**: Runs on every commit/Pull Request
- **Integration Tests**: Test service-to-service interactions - **Comprehensive**: Covers contract tests, integration tests, and security scans
- **Security Scans**: SAST/SBOM analysis on every PR - **Blocking**: Prevents merges that violate the contract
**Best Practices**: **Best Practices**:
- Run validation on every pull request - Use contract testing tools (Dredd, Prism) to validate API contracts
- Block merges on contract violations - Run integration tests on every commit
- Track build success rate as a KPI - Include security scans in the CI pipeline
- Fail builds on contract violations
**Common Pitfalls to Avoid**:
- Not running tests on every commit
- Allowing manual overrides of validation gates
- Not updating tests when the Spec changes
--- ---
### 7. Maintenance ### 7. Maintenance
**Purpose**: Guide for long-term health and evolution. **Purpose**: Ensure Long-Term Health
**Why It Matters**: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up. The Maintenance document defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software.
**Key Elements**: **Key Characteristics**:
- **Dependency Update Schedule**: When and how to upgrade packages - **Procedural**: Step-by-step instructions for common tasks
- **Secret Rotation Steps**: How to rotate credentials securely - **Scheduled**: Includes regular maintenance windows
- **DB Migration Logs**: History of schema changes - **Documented**: Tracks technical debt and migration history
- **Tech Debt "Graveyard"**: Documented technical debt with remediation plans
**Best Practices**: **Best Practices**:
- Document the "how" for common maintenance tasks - Document dependency update schedules
- Track package age and security vulnerabilities - Create secret rotation procedures
- Schedule regular tech debt reviews - Track technical debt in a "Graveyard"
- Document migration history and rollback procedures
**Common Pitfalls to Avoid**:
- Ad-hoc upgrades without documentation
- Ignoring technical debt until it becomes critical
- Not testing upgrades in staging first
--- ---
### 8. Runbook ### 8. Runbook
**Purpose**: Operational life-support for production systems. **Purpose**: Operational Life-Support
**Why It Matters**: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward. The Runbook provides instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure.
**Key Elements**: **Key Characteristics**:
- **Deployment Steps**: How to deploy new versions - **Action-Oriented**: Step-by-step instructions for common operations
- **Scaling Triggers**: When and how to scale up/down - **Automated**: Infrastructure as code defines the desired state
- **Backup/Restore Procedures**: Disaster recovery steps - **Crisis-Ready**: Includes "3:00 AM" troubleshooting guides
- **"3:00 AM" Troubleshooting**: Quick fixes for common failures
**Best Practices**: **Best Practices**:
- Store in K8s manifests (Flux/Argo) for GitOps - Document deployment procedures
- Automate as much as possible - Define scaling triggers and procedures
- Test runbook procedures regularly - Include backup and restore procedures
- Create troubleshooting guides for common issues
**Common Pitfalls to Avoid**:
- Not documenting procedures for common issues
- Not testing runbook procedures
- Not versioning runbooks with the infrastructure
--- ---
## How to Use This Framework ## How to Use This Approach Effectively
1. **Start with Requirements** - Define the business problem and success criteria ### Phase 1: Foundation (Week 1-2)
2. **Create the Spec** - Translate requirements into machine-readable contracts
3. **Design Architecture** - Visualize how the system will work
4. **Write Walkthrough** - Document the logic and trade-offs
5. **Implement** - Build the actual code
6. **Set up Validation** - Add automated tests and gates
7. **Document Maintenance** - Plan for long-term health
8. **Create Runbook** - Define operational procedures
This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals. 1. **Create Requirements Document**
- Define the Business North Star
- Establish success metrics
- Define out-of-scope items
2. **Write the Spec**
- Define all data interfaces
- Establish naming conventions
- Document validation rules
3. **Design Architecture**
- Create system diagrams
- Document data flow
- Identify potential bottlenecks
### Phase 2: Development (Week 3+)
4. **Write Walkthrough**
- Document end-to-end flows
- Explain architectural trade-offs
- Create mental models for developers
5. **Implement Code**
- Auto-generate boring parts from Spec
- Write business logic
- Implement tests
### Phase 3: Quality Assurance
6. **Set Up Validation**
- Configure CI/CD pipeline
- Set up contract testing
- Configure security scans
7. **Create Runbook**
- Document deployment procedures
- Define scaling triggers
- Create troubleshooting guides
### Phase 4: Maintenance
8. **Document Maintenance**
- Create dependency update schedule
- Document secret rotation
- Track technical debt
---
## Key Principles for Success
1. **Separation of Concerns**: Keep business concerns separate from technical concerns
2. **Machine-Readable Contracts**: Use OpenAPI/Protobuf for specs to enable automation
3. **Automation**: Automate boring parts and validation to reduce human error
4. **Measurability**: Every document should have measurable outcomes
5. **Version Control**: Keep all documentation in Git for history and collaboration
6. **Living Documents**: Update documentation as the system evolves
7. **Audience-Focused**: Write for the intended audience's needs and knowledge level
---
## Conclusion
The SDD + GitOps Documentation Framework provides a comprehensive, structured approach to software development documentation. By following this framework, teams can ensure that:
- Business goals are clearly defined and measurable
- Technical contracts are machine-readable and enforced
- System architecture is visualized and understood
- Developers have clear mental models of the system
- Code quality is maintained through automation
- Operations are reliable and repeatable
This framework is not just about documentation—it's about creating a shared understanding across the entire team and ensuring that every decision is aligned with business goals.