Cloud & Distributed Systems Architect — AWS EMEA Networking & Resilience Lead

Cristian
Critelli

Author of Cloud Networking and Resilience (Apress, 2026). Building the frameworks, patterns, and tools that keep distributed systems reliable at scale. 20+ years across IBM, Cisco, Microsoft, and AWS.

Geneva, CHApress AuthorDORA, NIS2 & Compliance Cloud ResilienceBGP & Cloud Networking AIOps, AI & Agentic InfrastructureCisco, AWS, Azure & 57+ Certifications
Scroll
Cloud Networking and Resilience — Book Cover

Cloud Networking
& Resilience

Apress Media, LLC — Coming May 2026

Designing Scalable, Fault-Tolerant, and Highly-Available Cloud Network Architectures.

From DNS and routing to traffic engineering, hybrid connectivity, fault isolation, and disaster recovery — culminating in AIOps and AI-driven network operations. Deep technical exploration of how to build resilient networks at every layer. Cloud-agnostic principles with AWS reference architectures and real-world deployment insights.

Dedicated to Ade (2017–2022). All personal proceeds donated to charities supporting cats in distress.

ISBN (Print)979-8-8688-2435-7
ISBN (eBook)979-8-8688-2436-4
AudienceNetwork Engineers · Cloud Architects · SREs · DevOps · CROs · Resilience Officers
02About

I've spent 20+ years building, breaking, and fixing networks — from Rome's telecom exchanges to London's enterprise data centers to Geneva's cloud architectures. The path went through IBM, Cisco, Riverbed, Microsoft, and now AWS, where I lead networking and resilience strategy across EMEA.

My book Cloud Networking and Resilience (Apress, May 2026) distills everything I've learned about keeping distributed systems reliable — from BGP routing to cell-based architectures to chaos engineering. The live monitor on this page is a working demonstration of those concepts.

Over 57 certifications across Cisco, AWS, Microsoft Azure, Riverbed, IBM, Wireshark, Aviatrix, and more — spanning networking, security, cloud architecture, and resilience. Recognised with awards from Microsoft (WW SME Networking, CSS Impact), AWS (AWSome All Star, Hidden Hero, Growth Mindset, AWSome Builder), Aviatrix (Cloud Networking Hero), and DeMolay International (Chevalier, Legion of Honour).

Full Career on LinkedIn ↗ Follow on LinkedIn Read the Blog ↗

I don't just write about resilience. I measure it — in real time, from 20 collectors, across every continent.

Live Data

Global Internet Resilience Monitor

Real-time BGP routing intelligence from RIPE RIS collectors worldwide. Every concept in the book — resilience scoring, route analytics, convergence timing, deep signal analysis — running live. Single-file. Zero dependencies. Built from scratch.

96
/ 100
Resilience Score
Healthy
RTBH BlackholeClear
0
No blackhole events detected
RTBH :666 community monitoring
N. America
98
12 collectors
Europe
97
14 collectors
Asia-Pacific
95
4 collectors
S. America
94
2 collectors
Middle East
96
1 collector
Africa
93
1 collector
Resilience Score — Last 60 Seconds
Live BGP Event Stream
0
Announcements
0
Withdrawals
0
Unique ASNs
0
Prefixes
0
Updates/sec
0
RIS Peers
AS Path Length Distribution
Normal
Avg path: — hops · Longest: — hops
Protocol Distribution
Dual-Stack
0
total prefixes observed
IPv4
IPv6
IPv4: 0IPv6: 0
Route Flap Detector
Stable
0
flapping prefixes (≥3 state changes / 30s)
Monitoring prefix stability…
MOAS Detection
Clear
0
multiple-origin AS conflicts detected
Prefixes announced by ≥2 distinct origin ASNs
Convergence Timing
Normal
Mean (ms)
p50
p95
0
Events
Withdrawal → re-announcement convergence time · last 50 events · 0–2s scale
Top Unstable Prefixes
Stable
Ranked by state changes in rolling 60s window
Deep Signal Analysis
BGP Community Signals
Normal
0
community-tagged updates
⬤ RTBH: 0 ⬤ No-Export: 0 ⬤ TE: 0
AS Path Loop Detection
Clear
0
path loops detected (duplicate ASN in path)
Prefix Deaggregation
Normal
0
≤/20
0
/21–/23
0
/24+
0%
/24+ ratio
Ratio of specific (/24+) vs aggregate prefixes — spikes indicate DDoS mitigation or misconfig
First-Seen Prefix Alerts
Normal
0
new prefixes in last 60 seconds
Peer Diversity Index
Diverse
avg unique peers per prefix
Top-5 most observed prefixes by unique RIS peer count. Higher = more globally visible.
TCP/TLS RTT · Global Endpoints
Probing…
Cloudflare
Google
AWS EU
Azure
Akamai
Quad9
Connection Tier
Probing…
Avg RTT
Best RTT
Jitter
Your Session Intelligence
Your Network
Detecting…
ASN
ISP
City
Country
Your ASN spotted in the live feed:
0
announcements involving your network observed this session
Global Routing Table Coverage
Observing
0
of ~985,000 global IPv4+IPv6 routes
0.00%
Unique prefixes observed during this session

Still here? Good. Most people don't look past the surface. You clearly think differently.

See this dashboard the way a NOC engineer would. Full screen. No distractions. Just the internet, live.

CRITELLI NOC · LIVE BGP INTELLIGENCE
03Resilience in Action

Interactive cell-based architecture simulation. Click any cell or Direct Connect path to inject a failure — watch Route 53 health checks detect, ARC routing controls flip via the control plane, and traffic reroute.

Bulkhead isolation · Shuffle sharding · Route 53 ARC · Cell Router (ELB) · DX Maximum Resilience · Active/Active ECMP

System Healthy
Availability SLO
99.99%
≤ 52 min downtime/yr
Active Cells
4 / 4
All regions healthy
Blast Radius
0%
No impact
Estimated RTO
0s
No failover in progress
Incident Timeline
Click a cell or DX path to begin
Failover Progress
Steady State
FailureDetect (30s)ARC FlipTTL 60sComplete
Multi-Region Cell Architecture Live
Healthy Failed Detecting HTTP HC ARC Probe DX Monitor VPN DX Path
Click cells or DX paths to inject failure · Hover for details
ⓘ Architecture Reference

◎ Route 53

Anycast DNS entry point. Health checks poll cells every 10s with 3-failure threshold (~30s detection). Returns healthy cell IPs to clients.

◆ ARC Routing Controls

Control plane only — not in data path. Overrides Route 53 routing control states (ON/OFF per cell) to steer traffic away from impaired cells. Readiness checks audit cell resources every 60s (capacity, config, quotas). Multi-Region cluster ensures extreme availability.

■ Cell Router (ELB)

Elastic Load Balancer per cell. Routes client requests to cell compute. Layer 7 routing with path-based rules, cross-AZ distribution.

◼ Cell (Bulkhead)

Independent fault domain. Each cell has its own compute, data store, and ELB. Failures are contained within cell boundaries (blast radius reduction).

△ Aurora Global DB

Aurora Global Database has one primary writer cluster (Cell A) with read replicas in other cells and Regions (<1s replication lag). If the writer fails, Aurora promotes a read replica to writer. Cell B holds a local read replica in the same Region for low-latency reads. You could substitute DynamoDB Global Tables or another multi-region data store depending on your workload.

◆ DX Gateway + TGW Peering

DX Gateway is a global construct that associates with regional Transit Gateways and Direct Connect connections, enabling on-premises traffic to reach any associated Region without separate DX connections per Region. For Region-to-Region traffic (e.g. Cell A replicating to Cell C), inter-Region TGW peering provides a direct path between Regions without hairpinning through on-premises — completing the multi-Region mesh.

◆ DX Maximum Resilience

4 connections across 2 geographically separated metros (active/active ECMP). Both paths carry traffic simultaneously. VPN backup activates only when both DX paths fail. For simplicity, the simulation shows one link per metro location rather than the full two-connection-per-site topology.

◆ VPN Backup

IPSec tunnel in standby. Auto-activates when all DX connections fail. Higher latency but ensures on-premises connectivity is never fully lost.

◆ Shuffle Sharding

Probabilistic tenant isolation. Instead of linear blast radius (fc/tc), overlap probability is P² — dramatically reducing the chance that any two tenants share the same failed cells.

⚙ Simplified View

For clarity, this simulation omits some networking components: regional Transit Gateways, TGW peering attachments, VPC route tables, security groups, NACLs, and NAT Gateways. In production, each cell would have its own TGW attachment and full VPC networking stack.

04What I Do

Cloud Networking

Enterprise-grade connectivity across AWS, Azure, and GCP. Direct Connect, ExpressRoute, Cloud WAN, Transit Gateway, VWAN, PrivateLink — at global scale.

DXTGWCloud WANExpressRouteVWAN

Resilience Engineering

Cell architectures, bulkhead patterns, shuffle sharding, blast radius reduction. Author of the Resilience Lifecycle Framework and the D-CAT compliance engine.

Cell ArchitectureARCFISHA/DRDRS

AIOps & AI Infrastructure

AIOps-driven observability, anomaly detection, and predictive remediation. GPU/TPU compute design, secure inference pipelines, and AI-ready network fabrics for distributed systems at enterprise scale.

AIOpsGPU/TPUInferenceML OpsAI-Ready

DORA, NIS2 & Compliance

EU regulatory compliance: DORA (Digital Operational Resilience Act) and NIS2 (Network and Information Security Directive). Built D-CAT — scanning 30+ AWS services across multiple regions. Enabling 22,000+ regulated entities across EMEA.

DORANIS2D-CATICT RiskFSIRegTech

Traffic Engineering & BGP

BGP, MPLS, segment routing, DNSSEC, Anycast DNS, multi-provider resilience. Deep protocol-level expertise from ISP and Telco roots. The monitor above runs on this.

BGPMPLSDNSSECAnycastSR

Security Architecture

Zero trust at cloud scale. Stateful/stateless firewalls, DDoS mitigation, DLP, micro-segmentation. Fortune 100 deployments across 15,000+ workloads.

Zero TrustSegmentationDLPDDoS
05Speaking & Writing
🎤

Conference Talks

Available for keynotes, breakout sessions, and technical deep-dives on cloud resilience, AIOps, DORA compliance, BGP routing, and distributed systems architecture.

KeynotesBreakoutsDeep-Dives
🎙️

Podcasts & Panels

Happy to join conversations about internet resilience, cloud networking at scale, the future of infrastructure, and what DORA and NIS2 mean for regulated industries.

PodcastsPanelsInterviews
✍️

Technical Writing

Published author (Apress). Contributor to technical guides and best-practice documentation. Writing about what I build, break, and fix.

ApressBlogGuides

Topics I speak and write about:

Cloud Resilience PatternsCell-Based ArchitectureBGP & Internet Routing DORA, NIS2 & ComplianceDisaster Recovery at ScaleDNS Resilience Chaos EngineeringMulti-Cloud NetworkingObservability for Networks AIOps & Intelligent Operations

Let's build something
resilient

Available for conference talks, podcasts, and technical deep-dives on cloud resilience and networking architecture. For press and book inquiries, reach out directly.

LinkedIn Follow GitHub Blog Email