- Python CLI tool for rolling updates of CL-AppPipe-* and CL-SvcPipe-* stacks - Async update engine with configurable concurrency (asyncio.Semaphore) - Exponential backoff retry for API throttling - Dry-run mode for safe preview - IAM permission pre-validation - Comprehensive test suite (80 tests: 11 property-based + 69 unit) - Full spec documentation (requirements, design, tasks)
388 lines
15 KiB
Markdown
388 lines
15 KiB
Markdown
# Design Document: One-Click CloudFormation Stack Updater
|
||
|
||
## Overview
|
||
|
||
The One-Click CFN Stack Updater is a CLI tool (Python) that automates the rolling update of all `CL-AppPipe-*` CloudFormation stacks in an audit account. It discovers stacks dynamically by prefix, validates IAM permissions, and updates each stack using its existing parameters so the only change is the refreshed nested template URL resolving to the latest version. The tool supports concurrency control, dry-run mode, exponential backoff on throttling, and produces a structured summary report.
|
||
|
||
The tool is implemented as a single Python package using `boto3` for AWS interactions. Python is chosen because it is the standard language for AWS automation tooling, `boto3` provides first-class CloudFormation support, and the operator audience is already familiar with Python-based AWS scripts.
|
||
|
||
## Architecture
|
||
|
||
The system follows a pipeline architecture with four sequential phases:
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
A[Permission\nValidation] --> B[Stack\nDiscovery]
|
||
B --> C[Stack\nUpdate Engine]
|
||
C --> D[Report\nGenerator]
|
||
```
|
||
|
||
1. **Permission Validation** — Verifies the executing role has the required IAM permissions before any work begins.
|
||
2. **Stack Discovery** — Lists all CloudFormation stacks matching the `CL-AppPipe-` prefix and filters to updatable states.
|
||
3. **Stack Update Engine** — Updates stacks concurrently (bounded by `Concurrency_Limit`) with retry logic for throttling errors.
|
||
4. **Report Generator** — Aggregates results and produces the final summary.
|
||
|
||
### Concurrency Model
|
||
|
||
The update engine uses a semaphore-based concurrency model with `asyncio` to run up to `Concurrency_Limit` stack updates in parallel. Each update is an independent coroutine that:
|
||
1. Fetches current stack parameters
|
||
2. Calls `UpdateStack` with existing parameters and the nested template URL
|
||
3. Polls `DescribeStacks` until the update completes or fails
|
||
4. Records the result
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
S[Semaphore: Concurrency_Limit] --> U1[Update Stack 1]
|
||
S --> U2[Update Stack 2]
|
||
S --> U3[Update Stack N]
|
||
U1 --> R[Result Collector]
|
||
U2 --> R
|
||
U3 --> R
|
||
R --> Report[Summary Report]
|
||
```
|
||
|
||
## Components and Interfaces
|
||
|
||
### CLI Entry Point (`cli.py`)
|
||
|
||
Parses command-line arguments and orchestrates the pipeline.
|
||
|
||
```python
|
||
def main(
|
||
prefix: str = "CL-AppPipe-",
|
||
concurrency: int = 5,
|
||
dry_run: bool = False,
|
||
region: str | None = None,
|
||
) -> int:
|
||
"""
|
||
Entry point. Returns 0 on full success, 1 if any stack failed.
|
||
"""
|
||
```
|
||
|
||
**Arguments:**
|
||
| Flag | Type | Default | Description |
|
||
|------|------|---------|-------------|
|
||
| `--prefix` | `str` | `CL-AppPipe-` | Stack name prefix to match |
|
||
| `--concurrency` | `int` | `5` | Max parallel updates |
|
||
| `--dry-run` | `bool` | `False` | Preview mode, no updates |
|
||
| `--region` | `str` | SDK default | AWS region override |
|
||
|
||
### Permission Validator (`permissions.py`)
|
||
|
||
```python
|
||
def validate_permissions(cfn_client) -> list[str]:
|
||
"""
|
||
Checks required IAM permissions by performing dry-run API calls.
|
||
Returns a list of missing permission names. Empty list means all OK.
|
||
"""
|
||
```
|
||
|
||
Validates by attempting:
|
||
- `cloudformation:ListStacks` — calls `list_stacks` with a narrow filter
|
||
- `cloudformation:DescribeStacks` — calls `describe_stacks` with a non-existent stack name (expects specific error)
|
||
- `cloudformation:UpdateStack` — validated implicitly during updates; pre-check uses IAM policy simulation via `iam:SimulatePrincipalPolicy` if available, otherwise deferred
|
||
|
||
### Stack Discovery (`discovery.py`)
|
||
|
||
```python
|
||
@dataclass
|
||
class DiscoveredStack:
|
||
name: str
|
||
status: str
|
||
updatable: bool
|
||
|
||
def discover_stacks(cfn_client, prefix: str) -> list[DiscoveredStack]:
|
||
"""
|
||
Lists all stacks matching the prefix. Paginates through all results.
|
||
Marks each stack as updatable or not based on its status.
|
||
"""
|
||
```
|
||
|
||
**Non-updatable statuses:**
|
||
- `ROLLBACK_COMPLETE`
|
||
- `ROLLBACK_IN_PROGRESS`
|
||
- `UPDATE_ROLLBACK_IN_PROGRESS`
|
||
- `UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS`
|
||
- `DELETE_IN_PROGRESS`
|
||
- `DELETE_COMPLETE`
|
||
|
||
### Stack Updater (`updater.py`)
|
||
|
||
```python
|
||
@dataclass
|
||
class StackUpdateResult:
|
||
stack_name: str
|
||
status: Literal["succeeded", "failed", "skipped", "no-update-needed"]
|
||
error: str | None = None
|
||
duration_seconds: float = 0.0
|
||
|
||
async def update_stack(
|
||
cfn_client,
|
||
stack_name: str,
|
||
template_url: str,
|
||
max_retries: int = 3,
|
||
) -> StackUpdateResult:
|
||
"""
|
||
Updates a single stack. Handles 'No updates' response, throttling retries,
|
||
and non-updatable state detection.
|
||
"""
|
||
|
||
async def update_all_stacks(
|
||
cfn_client,
|
||
stacks: list[DiscoveredStack],
|
||
template_url: str,
|
||
concurrency: int = 5,
|
||
max_retries: int = 3,
|
||
) -> list[StackUpdateResult]:
|
||
"""
|
||
Updates all stacks with bounded concurrency using asyncio.Semaphore.
|
||
"""
|
||
```
|
||
|
||
**Retry logic:**
|
||
- Triggered on `Throttling` or `RequestLimitExceeded` error codes
|
||
- Exponential backoff: `base_delay * 2^attempt` (base_delay = 1s)
|
||
- Maximum 3 retries per stack
|
||
|
||
### Report Generator (`report.py`)
|
||
|
||
```python
|
||
@dataclass
|
||
class UpdateRunReport:
|
||
start_time: datetime
|
||
end_time: datetime
|
||
total_found: int
|
||
succeeded: int
|
||
failed: int
|
||
skipped: int
|
||
no_update_needed: int
|
||
results: list[StackUpdateResult]
|
||
|
||
def generate_report(
|
||
results: list[StackUpdateResult],
|
||
total_found: int,
|
||
start_time: datetime,
|
||
end_time: datetime,
|
||
) -> UpdateRunReport:
|
||
"""
|
||
Aggregates results into a summary report.
|
||
"""
|
||
|
||
def format_report(report: UpdateRunReport) -> str:
|
||
"""
|
||
Formats the report as a human-readable string for console output.
|
||
"""
|
||
```
|
||
|
||
## Data Models
|
||
|
||
### DiscoveredStack
|
||
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `name` | `str` | CloudFormation stack name |
|
||
| `status` | `str` | Current stack status (e.g., `CREATE_COMPLETE`) |
|
||
| `updatable` | `bool` | Whether the stack is in an updatable state |
|
||
|
||
### StackUpdateResult
|
||
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `stack_name` | `str` | Name of the stack |
|
||
| `status` | `Literal["succeeded", "failed", "skipped", "no-update-needed"]` | Outcome of the update attempt |
|
||
| `error` | `str \| None` | Error message if failed |
|
||
| `duration_seconds` | `float` | Time taken for this stack's update |
|
||
|
||
### UpdateRunReport
|
||
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `start_time` | `datetime` | When the Update_Run started |
|
||
| `end_time` | `datetime` | When the Update_Run ended |
|
||
| `total_found` | `int` | Total Target_Stacks discovered |
|
||
| `succeeded` | `int` | Count of successfully updated stacks |
|
||
| `failed` | `int` | Count of failed stacks |
|
||
| `skipped` | `int` | Count of skipped (non-updatable) stacks |
|
||
| `no_update_needed` | `int` | Count of stacks with no changes |
|
||
| `results` | `list[StackUpdateResult]` | Per-stack results |
|
||
|
||
### Configuration Constants
|
||
|
||
```python
|
||
TEMPLATE_URL = "https://s3.amazonaws.com/solutions-reference/centralized-logging-with-opensearch/latest/AppLogS3Buffer.template"
|
||
DEFAULT_PREFIX = "CL-AppPipe-"
|
||
DEFAULT_CONCURRENCY = 5
|
||
MAX_RETRIES = 3
|
||
BASE_RETRY_DELAY = 1.0 # seconds
|
||
NON_UPDATABLE_STATUSES = frozenset({
|
||
"ROLLBACK_COMPLETE",
|
||
"ROLLBACK_IN_PROGRESS",
|
||
"UPDATE_ROLLBACK_IN_PROGRESS",
|
||
"UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS",
|
||
"DELETE_IN_PROGRESS",
|
||
"DELETE_COMPLETE",
|
||
})
|
||
```
|
||
|
||
|
||
## Correctness Properties
|
||
|
||
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
|
||
|
||
### Property 1: Discovery returns exactly prefix-matched stacks with correct count
|
||
|
||
*For any* list of CloudFormation stacks with arbitrary names, the discovery function should return exactly those stacks whose names start with the configured prefix, and the reported count should equal the length of that filtered list.
|
||
|
||
**Validates: Requirements 1.1, 1.3**
|
||
|
||
### Property 2: All updatable stacks are attempted
|
||
|
||
*For any* set of discovered stacks marked as updatable, the update engine should produce exactly one update result per updatable stack — no stack is silently dropped and no stack is attempted twice.
|
||
|
||
**Validates: Requirements 2.2**
|
||
|
||
### Property 3: Concurrency limit invariant
|
||
|
||
*For any* positive concurrency limit and any list of stacks, at no point during an Update_Run should the number of concurrently in-progress stack updates exceed the specified concurrency limit.
|
||
|
||
**Validates: Requirements 2.3, 3.3**
|
||
|
||
### Property 4: Update call preserves existing parameters and uses correct template URL
|
||
|
||
*For any* stack with any set of existing parameters, the UpdateStack API call should include exactly those same parameter keys with `UsePreviousValue=True`, and the `TemplateURL` argument should equal the configured `Nested_Template_URL`.
|
||
|
||
**Validates: Requirements 3.1, 3.2**
|
||
|
||
### Property 5: "No updates" response maps to no-update-needed status
|
||
|
||
*For any* stack where CloudFormation returns a "No updates are to be performed" error, the resulting `StackUpdateResult` should have status `no-update-needed` (not `failed`).
|
||
|
||
**Validates: Requirements 3.4**
|
||
|
||
### Property 6: Non-updatable stacks are skipped
|
||
|
||
*For any* stack whose CloudFormation status is in the set of non-updatable statuses (e.g., `ROLLBACK_COMPLETE`, `DELETE_IN_PROGRESS`), the result should have status `skipped` and no UpdateStack API call should be made for that stack.
|
||
|
||
**Validates: Requirements 4.2**
|
||
|
||
### Property 7: Fault isolation — failures do not block remaining stacks
|
||
|
||
*For any* list of N updatable stacks where K of them fail (including after retry exhaustion), the update engine should still produce results for all N stacks, and the number of attempted updates should equal N.
|
||
|
||
**Validates: Requirements 4.1, 4.4**
|
||
|
||
### Property 8: Throttling triggers exponential backoff retries
|
||
|
||
*For any* stack that receives throttling errors, the system should retry up to `MAX_RETRIES` times, and the delay between the i-th and (i+1)-th attempt should be at least `BASE_RETRY_DELAY * 2^i` seconds.
|
||
|
||
**Validates: Requirements 4.3**
|
||
|
||
### Property 9: Report aggregation and exit code correctness
|
||
|
||
*For any* list of `StackUpdateResult` values, the generated report's `succeeded`, `failed`, `skipped`, and `no_update_needed` counts should equal the actual counts of each status in the input list, and the exit code should be non-zero if and only if `failed > 0`.
|
||
|
||
**Validates: Requirements 5.2, 5.3**
|
||
|
||
### Property 10: Dry-run performs no updates and lists all discovered stacks
|
||
|
||
*For any* set of discovered stacks, when dry-run mode is enabled, zero UpdateStack API calls should be made, and the output should contain the name and current status of every discovered stack.
|
||
|
||
**Validates: Requirements 6.2, 6.3**
|
||
|
||
### Property 11: Permission validation correctness
|
||
|
||
*For any* subset of required permissions that are missing, the permission validator should return exactly those missing permissions, and when any permissions are missing, the Update_Run should terminate without making any UpdateStack API calls.
|
||
|
||
**Validates: Requirements 7.1, 7.2**
|
||
|
||
## Error Handling
|
||
|
||
### Error Categories and Responses
|
||
|
||
| Error | Source | Response |
|
||
|-------|--------|----------|
|
||
| Missing IAM permissions | Permission validation phase | Report missing permissions, exit with non-zero code, no updates attempted |
|
||
| No stacks found | Discovery phase | Log warning, exit with code 0 (not an error) |
|
||
| Stack in non-updatable state | Update phase | Skip stack, log warning, record as `skipped` |
|
||
| "No updates to be performed" | CloudFormation UpdateStack API | Treat as success, record as `no-update-needed` |
|
||
| Throttling / RequestLimitExceeded | CloudFormation API | Retry with exponential backoff (max 3 retries) |
|
||
| Throttling after max retries | CloudFormation API | Mark stack as `failed`, continue with remaining stacks |
|
||
| UpdateStack failure (other) | CloudFormation API | Log error details, mark as `failed`, continue with remaining stacks |
|
||
| Boto3 connection error | Network / SDK | Mark stack as `failed`, log error, continue |
|
||
| Invalid CLI arguments | Argument parsing | Print usage, exit with non-zero code |
|
||
|
||
### Retry Strategy
|
||
|
||
```python
|
||
async def retry_with_backoff(func, max_retries=3, base_delay=1.0):
|
||
for attempt in range(max_retries + 1):
|
||
try:
|
||
return await func()
|
||
except ClientError as e:
|
||
code = e.response["Error"]["Code"]
|
||
if code in ("Throttling", "RequestLimitExceeded") and attempt < max_retries:
|
||
delay = base_delay * (2 ** attempt)
|
||
await asyncio.sleep(delay)
|
||
else:
|
||
raise
|
||
```
|
||
|
||
### Exit Codes
|
||
|
||
| Code | Meaning |
|
||
|------|---------|
|
||
| `0` | All stacks updated successfully (or no stacks found, or dry-run) |
|
||
| `1` | One or more stacks failed to update |
|
||
| `2` | Permission validation failed |
|
||
|
||
## Testing Strategy
|
||
|
||
### Testing Framework
|
||
|
||
- **Unit tests**: `pytest`
|
||
- **Property-based tests**: `hypothesis` (Python's standard PBT library)
|
||
- **Mocking**: `unittest.mock` and `botocore.stub.Stubber` for AWS API mocking
|
||
|
||
### Property-Based Tests
|
||
|
||
Each correctness property from the design maps to a single property-based test. All property tests run a minimum of 100 iterations using Hypothesis settings.
|
||
|
||
| Property | Test Description | Key Generators |
|
||
|----------|-----------------|----------------|
|
||
| P1 | Discovery prefix filtering | Random stack name lists (some with prefix, some without) |
|
||
| P2 | All updatable stacks attempted | Random lists of DiscoveredStack with mixed updatable flags |
|
||
| P3 | Concurrency limit invariant | Random concurrency values (1–20), random stack counts (1–50) |
|
||
| P4 | Parameter preservation and template URL | Random parameter key-value dicts |
|
||
| P5 | "No updates" status mapping | Random stacks with mocked "no updates" responses |
|
||
| P6 | Non-updatable stack skipping | Random stacks with statuses drawn from updatable and non-updatable sets |
|
||
| P7 | Fault isolation | Random stack lists with random failure injection |
|
||
| P8 | Exponential backoff retries | Random retry counts (0–3), verify delay sequence |
|
||
| P9 | Report aggregation | Random lists of StackUpdateResult with random statuses |
|
||
| P10 | Dry-run no-op | Random discovered stacks, verify zero update calls |
|
||
| P11 | Permission validation | Random subsets of required permissions marked as missing |
|
||
|
||
Each test must be tagged with a comment:
|
||
```python
|
||
# Feature: one-click-cfn-stack-updater, Property 9: Report aggregation and exit code correctness
|
||
```
|
||
|
||
### Unit Tests
|
||
|
||
Unit tests complement property tests by covering:
|
||
|
||
- **Specific examples**: Known stack names, known parameter sets, expected API responses
|
||
- **Edge cases**: Empty stack list (Req 1.4), concurrency of 1, all stacks failing, all stacks already up-to-date
|
||
- **Integration points**: CLI argument parsing, boto3 Stubber-based API interaction tests
|
||
- **Error conditions**: Malformed API responses, unexpected exception types
|
||
|
||
### Test Organization
|
||
|
||
```
|
||
tests/
|
||
├── test_discovery.py # P1, P6 property tests + unit tests
|
||
├── test_updater.py # P2, P3, P4, P5, P7, P8 property tests + unit tests
|
||
├── test_report.py # P9 property tests + unit tests
|
||
├── test_dry_run.py # P10 property tests + unit tests
|
||
├── test_permissions.py # P11 property tests + unit tests
|
||
└── test_cli.py # CLI argument parsing unit tests
|
||
```
|