Building a Proactive SSL Certificate Monitoring & Notification Platform
Overview
SSL certificate expirations are a common, preventable cause of production outages, compliance violations, and emergency escalations. Despite this, many enterprises still rely on manual checks, spreadsheets, or tribal knowledge to track certificate lifecycles.
To address this risk, I designed and implemented a fully automated, enterprise-safe SSL certificate monitoring and notification platform that proactively detects expiration risk, enforces alert hygiene, and integrates cleanly with existing operational tooling—without introducing new infrastructure or policy risk.
This system now runs unattended, providing deterministic, auditable, and executive-safe visibility into certificate health across third-party platforms.
Problem
Prior to this solution:
- SSL certificate checks were manual and inconsistent
- Detection often occurred late or after expiration
- Notifications were ad-hoc, duplicative, or noisy
- Ownership and escalation paths were unclear
- Certificate failures posed a real Sev 1–5 outage risk
This created unnecessary operational exposure and consumed engineer time with low-leverage work.
Goals
The solution needed to:
- Detect certificate expiration risk before it caused incidents
- Eliminate manual checks and reminders
- Enforce consistent, professional notifications
- Avoid alert fatigue and duplicate messaging
- Integrate with approved enterprise tooling only
- Remain deterministic, auditable, and policy-compliant
Constraints & Design Principles
Key constraints shaped the architecture:
- No reliance on cloud-native monitoring platforms
- No AI making runtime decisions or sending emails
- Must operate within existing enterprise security boundaries
- Must be inspectable and safe to test without sending emails
Design principles:
- Deterministic logic over probabilistic AI
- Queue-based notification delivery
- Strong separation of detection, decisioning, and delivery
- Human-safe by default
Architecture Summary
The system is intentionally divided into three layers.
1. Deterministic Monitoring (Python)
A scheduled Python job runs daily and:
- Performs TLS connections with SNI support
- Extracts certificate expiration dates
- Computes days remaining
- Classifies status (OK, Expiring, Expired, Monitor Failed)
- Applies standardized alert rules:
- Threshold reminders (60 / 45 / 30 / 14 / 7 / 3 / 1 days)
- Daily nags inside the final window
- Urgent flags for production environments
- Deduplicates alerts so the same condition is never sent twice in one day
All logic is deterministic and testable.
2. Notification Queue (Decoupling Layer)
Instead of sending emails directly, the system writes structured files to a SharePoint-synced folder:
cert_status.json— full visibility snapshotnotifications_outbox.json— only alerts due todayemail_send_queue.json— send-ready email queuenotification_state.json— deduplication memory
This design:
- Prevents accidental spam
- Allows inspection before delivery
- Enables safe dry runs and testing
- Decouples detection from communication
3. Delivery (Power Automate + Outlook)
A scheduled Power Automate flow:
- Reads the email send queue
- Parses structured email objects
- Applies lightweight guards
- Sends notifications via Outlook with appropriate importance
If the queue is empty, nothing is sent.
Role of AI (Gemini)
AI is used only at design time, not runtime.
Gemini was leveraged to:
- Draft and refine operational email templates
- Improve clarity, tone, and escalation language
- Ensure communications were executive-safe and consistent
AI is not used to:
- Detect certificate data
- Decide when to alert
- Send emails
- Modify files at runtime
This preserves determinism, auditability, and compliance while still benefiting from AI-assisted quality improvements.
Results & Impact
Quantified Impact
- Eliminated manual certificate checks, saving ~10–20 engineer-hours per quarter
- Reduced detection latency to ≤24 hours, replacing ad-hoc or delayed awareness
- Improved alert quality through deduplication and standardized thresholds
- Reduced risk of certificate-related outages via proactive, multi-stage alerts
Operational Outcomes
- No reliance on human memory or spreadsheets
- Consistent, professional notifications every time
- Clear ownership and escalation paths
- Safe testing without sending emails
- Fully unattended daily operation
Why This Matters
This project demonstrates:
- Platform-level thinking over point solutions
- Strong judgment around where AI does and does not belong
- Focus on operational excellence and risk reduction
- Ability to design systems that scale across teams and vendors
- Alignment with enterprise security and compliance realities
The solution mirrors how many large organizations deploy AI safely today: AI improves communication and design quality, while deterministic systems own runtime decisions.
Key Takeaways
- Many reliability issues are process failures, not tooling gaps
- Separating detection, decisioning, and delivery dramatically improves safety
- Alert hygiene matters as much as alerting itself
- AI adds the most value when used intentionally, not indiscriminately
What’s Next
Potential future enhancements include:
- Trend reporting and dashboards
- Leadership-level risk summaries
- Expanded endpoint coverage
- Historical analysis of near-miss events
Final Note
This system was designed to be boring in the best way possible: predictable, inspectable, and reliable.
That’s exactly what production operations require.
