Acme CorpMessaging

You — Head of Operations

# Incident Response

— Production issues

Tuesday 21 April

Tom Brennan

SEV2 — Pennington Mutual settlement batch delayed. @here investigating. ~500k in payments stuck. started seeing errors around 09:45. pulling logs now

👀4

David Kimura

on it. what's the error signature?

Tom Brennan

settlement_processor throwing `DeadlockDetectedException` on the batch_jobs table. frequency is increasing

Lena Park

pulling query stats from pg_stat_activity now. give me 5 min

Ryan Kelly

checking connection pool metrics. infra dashboard is showing pool utilisation at 94% on settlement-db-01

Priya Sharma

heads up — Pennington relationship manager has already emailed. keeping them on hold until we have something concrete

Lena Park

found it. there's a long-running transaction that's been open since 09:43 — looks like the batch reconciliation job started before the previous one finished. two jobs holding locks on the same rows in `settlement_entries`

🔍2

Tom Brennan

so the batch scheduler fired the job twice?

Lena Park

looks that way. the idempotency check is supposed to prevent that but it's not working if the previous job is still in-flight. the lock contention is cascading — new transactions are piling up waiting for the deadlock to resolve

Ryan Kelly

connection pool just hit 100%. new connections are being rejected. this is going to start affecting other clients soon not just Pennington

😬3

David Kimura

ok priority is to kill the stuck transaction and drain the queue. lena, can you identify the pid?

Lena Park

pid 47832. it's been idle-in-transaction for 41 minutes. safe to terminate

David Kimura

go ahead and kill it. ryan — get ready to restart the connection pool manager once lena clears the lock

Lena Park

terminated. monitoring pg_locks now to confirm cascade clears

Ryan Kelly

pool utilisation dropping. 87%... 72%... ok it's draining

🤞2

Tom Brennan

settlement processor is picking up again. seeing successful batch completions in the logs

✅3

Priya Sharma

good. how many transactions are still in the backlog?

Lena Park

roughly 340 still queued. at normal throughput we're looking at ~90 minutes to clear. no further errors in the last 5 minutes

Tom Brennan

update for priya to pass to Pennington: root cause identified and resolved. settlement processing resumed. full backlog expected to clear by 12:15 UTC. no data loss

👍2

Ryan Kelly

restarting the batch scheduler with the duplicate-job guard enabled. should have been on by default tbh — that config flag is a trap

Lena Park

backlog is 60% cleared. all transactions processing cleanly

Tom Brennan

backlog fully cleared. 09:43 to ~11:40, total duration about 2 hours. all Pennington settlements confirmed. marking SEV2 resolved

✅4

David Kimura

nice work everyone. tom, can you write up the postmortem doc and pin the resolution here?

Tom Brennan📌

RESOLVED — SEV2 Pennington Settlement Delay (2026-03-23) Duration: 09:43–11:47 UTC (~2hr 4min) Impact: ~500k in settlement payments delayed for Pennington Mutual. No data loss. Root cause: Batch scheduler fired duplicate reconciliation job due to misconfigured idempotency guard. Second job acquired overlapping row locks causing deadlock cascade. Connection pool exhaustion secondary effect. Resolution: Terminated hung transaction (pid 47832), drained connection pool, re-enabled duplicate-job guard in batch scheduler config. Actions: Batch scheduler config reviewed and corrected. PR raised to enforce idempotency guard as non-optional. Postmortem doc in Notion: [INC-2026-031]

👍6

Priya Sharma

Pennington confirmed all payments settled. they were understanding. good comms throughout

🙏1

Thursday 23 April

Ryan Kelly

reminder to everyone: if connection pool utilisation goes above 80% you'll see it in Grafana dashboard infra-db-pool. I've set an alert at 85% now. should've done this sooner

👍3

Saturday 25 April

Lena Park

weekly check — all systems nominal. settlement processing, KYC, and payment rails all green

✅1

Thursday 30 April

Tom Brennan

all quiet this week. nothing to report

Saturday 2 May

Lena Park

heads up — I'm seeing elevated error rates from Onfido on the KYC verification endpoint. rate limiting errors for Broadgate Financial onboarding flow. started about 20 minutes ago

👀1

Tom Brennan

how bad? are new KYC checks failing outright or just slowing down?

Lena Park

roughly 40% of requests returning 429. the rest are succeeding but with retries. Broadgate onboarding is partially degraded — applicants are seeing longer wait times and some are getting error screens

Tom Brennan

ok logging this as SEV3 KYC degradation. david-kimura FYI — Broadgate Financial onboarding affected, not a full outage but worth flagging to their account team

David Kimura

on it. notifying Broadgate account manager now. what's the fix path?

Lena Park

checking our Onfido rate limit tier. we're on 100 req/min and Broadgate's onboarding batch this morning is pushing 130+. they must have had a marketing campaign or something — volume spike. we need to either throttle our requests or upgrade the Onfido tier

Tom Brennan

can we add request queuing on our side to smooth it out within the 100/min limit?

Lena Park

yes — we have a basic rate limiter in the KYC service but it's not enabled by default. let me push a config change

Lena Park

rate limiter config deployed. KYC requests now smoothed to 90 req/min to give headroom. error rate dropping

✅2

Tom Brennan

Onfido 429s have stopped. all KYC checks going through cleanly. marking resolved. duration ~30 min. low impact — no failed verifications, just delays

👍2

David Kimura

Broadgate account manager says their client noticed some slowness but no escalation. good outcome

Lena Park

I'll raise a ticket to get the KYC rate limiter enabled by default and to review our Onfido tier — current limit is too tight for growth

👍1

Saturday 16 May

Ryan Kelly

just did a manual check on alerting coverage and we have zero PagerDuty alerts fired in the last 2 weeks. zero. checked the integration and the API token expired on april 3rd. we've been flying blind

😱4

Tom Brennan

wait what. the token expired two weeks ago and nobody noticed? I thought you were handling the PD token rotation ryan

Ryan Kelly

I thought YOU were. you set up the integration originally. I've never had visibility on the token expiry

Tom Brennan

I set it up 8 months ago, I haven't touched it since. there was no handover

Ryan Kelly

well someone should have documented who owns it. this is exactly the kind of thing that falls through the gaps

Priya Sharma

ok let's pause on the blame thread. the immediate problem is we have no alerting right now. ryan, can you regenerate the token and restore the integration today?

👍3

Ryan Kelly

yes. doing it now

Priya Sharma

once that's done I want to spend 20 minutes this afternoon mapping out all our critical integration tokens and who owns them. we'll add expiry reminders to the ops calendar and assign a named owner to each. tom and ryan — can you both join a quick call at 14:30?

👍2

Ryan Kelly

token regenerated. PagerDuty integration is live. running a test alert now

Ryan Kelly

test alert fired and received. we're back online

✅4

David Kimura

worth checking: were there any actual incidents in the last two weeks that we should have been paged for but weren't?

Lena Park

reviewed logs for the past 14 days. nothing that would have triggered a SEV1 or SEV2. we got lucky. the Onfido issue on May 2 was caught manually before it would have auto-paged anyway

😅2

Priya Sharma📌

Action items from 14:30 call: All integration tokens documented in Notion ops runbook with owner and expiry date. Calendar reminders set 30 days before each expiry. PagerDuty token: ryan-kelly is named owner. Onfido token: lena-park. Stripe webhook secret: tom-brennan. Monthly token audit added to ops checklist. Thanks both for getting this sorted quickly

👍5

Tom Brennan

for the record: not a production incident but this was a real gap. we were lucky. good catch ryan even if the conversation was a bit spicy

😅3

Ryan Kelly

yeah fair. good outcome

Thursday 21 May

Lena Park

weekly check — all good. no incidents, alerting confirmed healthy

✅1

Tuesday 26 May

Tom Brennan

flagging for visibility: seeing intermittent >500ms response times on Vertex Capital API calls. not consistent but happening enough that I'm watching it. p99 latency over the last hour is 620ms, up from baseline ~180ms

👀3

Ryan Kelly

what endpoints specifically?

Tom Brennan

/v2/payments/initiate and /v2/accounts/balance — both read and write paths affected. started around 10:40

Lena Park

pulling traces from Jaeger. will look at the full request path

Ryan Kelly

my first guess is connection pool exhaustion under load. Vertex integration uses the shared HTTP client pool and there's been a batch job running since 10:30. checking

Lena Park

traces show the latency is in the outbound HTTP call to Vertex, not in our processing. so it's either their side or something in the network path

Ryan Kelly

connection pool looks fine actually. utilisation at 45%, no queue buildup. so not that

Tom Brennan

checked Vertex status page — no incidents reported. going to reach out to their integration team

David Kimura

is this affecting live payments or just latency?

Tom Brennan

all requests completing successfully, just slow. no failures. Vertex Capital payments going through but taking longer. keeping an eye on it

👍1

Lena Park

I've put a 30-minute rolling average on the Vertex API latency in Grafana. if it spikes above 1s I'd want to know immediately

👍3

Tom Brennan

latency has come back down. p99 now 210ms, basically normal. vertex integration team not yet responded. going to keep monitoring through EOD

Ryan Kelly

keeping an eye on it overnight. if it spikes again we'll have more data to work with

Wednesday 27 May

Lena Park

update: saw another spike at 08:15 this morning, lasted about 12 minutes. p99 hit 780ms. then dropped back. pattern is intermittent — not load-correlated from our side

👀2

Tom Brennan

vertex integration team finally responded. they say they're not seeing anything on their end and their internal metrics look clean. they want us to send sample request IDs

Lena Park

sent them 15 request IDs from the two spike windows. waiting on response

Ryan Kelly

I've been checking if there's a pattern — time of day, specific IP routing, anything. the 10:40 spike yesterday and 08:15 today don't share a load profile. could be something in their CDN or load balancer doing rolling restarts

David Kimura

is there a business impact we need to communicate to Vertex relationship team or are we ok to keep this at the technical level for now?

Tom Brennan

no failures so far, just latency. I'd say keep it technical for now but if we see failures or it persists past this week we escalate

👍1

Lena Park

vertex responded to the request IDs. they can reproduce slightly elevated latency in those windows on their trace system but have no explanation yet. escalating internally on their side

Thursday 28 May

Ryan Kelly

third spike this morning. 09:05–09:18 UTC. p99 peaked at 910ms. still no failures. this is getting annoying

😤1

Tom Brennan

yeah. three spikes, no root cause, vertex can't explain it. I'm going to document what we know and log this as an open investigation. not enough to call it a formal incident but it needs tracking

Lena Park

agreed. I'll add a spike detector alert — if p99 goes above 600ms for more than 5 minutes, it pages. at least we'll have good data capture going forward

👍3

Ryan Kelly

also worth considering: do we have a circuit breaker on the Vertex client? if this degrades into actual failures we want to fail fast rather than queue up

Lena Park

we don't. I'll raise a ticket. it's a good call regardless of this specific issue

👍1

Tom Brennan

vertex integration team just sent a follow-up — they think it may be related to a BGP route change their network team made on May 25. they're reverting the change in their staging environment to test. no ETA on fix

🤔2

Ryan Kelly

BGP route change causing intermittent latency spikes — that would fit the pattern. not load correlated, short duration, happens at irregular intervals. makes sense

Tom Brennan

monitoring through end of week. if vertex confirms the fix I'll update here and close out the investigation.

👍3

Lena Park

no further spikes today. may be settling. keeping the monitor running

Tuesday 2 June

Tom Brennan

all quiet since wednesday. vertex confirmed BGP route revert deployed on their side thursday evening. p99 has been steady at ~190ms for 36 hours. closing out the investigation — root cause: third-party BGP routing change. no data loss, no payment failures

✅4

Lena Park

circuit breaker ticket is in backlog: PLAT-4421. spike alert is live in Grafana

👍2

Saturday 6 June

Ryan Kelly

weekly check complete. all integrations healthy, PD alerting confirmed active, no open incidents

✅2

Thursday 11 June

Lena Park

all systems nominal. nothing to report

# Incident Response

— Production issues

Tuesday 21 April

Tom Brennan

SEV2 — Pennington Mutual settlement batch delayed. @here investigating. ~500k in payments stuck. started seeing errors around 09:45. pulling logs now

👀4

David Kimura

on it. what's the error signature?

Tom Brennan

settlement_processor throwing `DeadlockDetectedException` on the batch_jobs table. frequency is increasing

Lena Park

pulling query stats from pg_stat_activity now. give me 5 min

Ryan Kelly

checking connection pool metrics. infra dashboard is showing pool utilisation at 94% on settlement-db-01

Priya Sharma

heads up — Pennington relationship manager has already emailed. keeping them on hold until we have something concrete

Lena Park

🔍2

Tom Brennan

so the batch scheduler fired the job twice?

Lena Park

Ryan Kelly

connection pool just hit 100%. new connections are being rejected. this is going to start affecting other clients soon not just Pennington

😬3

David Kimura

ok priority is to kill the stuck transaction and drain the queue. lena, can you identify the pid?

Lena Park

pid 47832. it's been idle-in-transaction for 41 minutes. safe to terminate

David Kimura

go ahead and kill it. ryan — get ready to restart the connection pool manager once lena clears the lock

Lena Park

terminated. monitoring pg_locks now to confirm cascade clears

Ryan Kelly

pool utilisation dropping. 87%... 72%... ok it's draining

🤞2

Tom Brennan

settlement processor is picking up again. seeing successful batch completions in the logs

✅3

Priya Sharma

good. how many transactions are still in the backlog?

Lena Park

roughly 340 still queued. at normal throughput we're looking at ~90 minutes to clear. no further errors in the last 5 minutes

Tom Brennan

update for priya to pass to Pennington: root cause identified and resolved. settlement processing resumed. full backlog expected to clear by 12:15 UTC. no data loss

👍2

Ryan Kelly

restarting the batch scheduler with the duplicate-job guard enabled. should have been on by default tbh — that config flag is a trap

Lena Park

backlog is 60% cleared. all transactions processing cleanly

Tom Brennan

backlog fully cleared. 09:43 to ~11:40, total duration about 2 hours. all Pennington settlements confirmed. marking SEV2 resolved

✅4

David Kimura

nice work everyone. tom, can you write up the postmortem doc and pin the resolution here?

Tom Brennan📌

👍6

Priya Sharma

Pennington confirmed all payments settled. they were understanding. good comms throughout

🙏1

Thursday 23 April

Ryan Kelly

reminder to everyone: if connection pool utilisation goes above 80% you'll see it in Grafana dashboard infra-db-pool. I've set an alert at 85% now. should've done this sooner

👍3

Saturday 25 April

Lena Park

weekly check — all systems nominal. settlement processing, KYC, and payment rails all green

✅1

Thursday 30 April

Tom Brennan

all quiet this week. nothing to report

Saturday 2 May

Lena Park

heads up — I'm seeing elevated error rates from Onfido on the KYC verification endpoint. rate limiting errors for Broadgate Financial onboarding flow. started about 20 minutes ago

👀1

Tom Brennan

how bad? are new KYC checks failing outright or just slowing down?

Lena Park

Tom Brennan

ok logging this as SEV3 KYC degradation. david-kimura FYI — Broadgate Financial onboarding affected, not a full outage but worth flagging to their account team

David Kimura

on it. notifying Broadgate account manager now. what's the fix path?

Lena Park

Tom Brennan

can we add request queuing on our side to smooth it out within the 100/min limit?

Lena Park

yes — we have a basic rate limiter in the KYC service but it's not enabled by default. let me push a config change

Lena Park

rate limiter config deployed. KYC requests now smoothed to 90 req/min to give headroom. error rate dropping

✅2

Tom Brennan

Onfido 429s have stopped. all KYC checks going through cleanly. marking resolved. duration ~30 min. low impact — no failed verifications, just delays

👍2

David Kimura

Broadgate account manager says their client noticed some slowness but no escalation. good outcome

Lena Park

I'll raise a ticket to get the KYC rate limiter enabled by default and to review our Onfido tier — current limit is too tight for growth

👍1

Saturday 16 May

Ryan Kelly

just did a manual check on alerting coverage and we have zero PagerDuty alerts fired in the last 2 weeks. zero. checked the integration and the API token expired on april 3rd. we've been flying blind

😱4

Tom Brennan

wait what. the token expired two weeks ago and nobody noticed? I thought you were handling the PD token rotation ryan

Ryan Kelly

I thought YOU were. you set up the integration originally. I've never had visibility on the token expiry

Tom Brennan

I set it up 8 months ago, I haven't touched it since. there was no handover

Ryan Kelly

well someone should have documented who owns it. this is exactly the kind of thing that falls through the gaps

Priya Sharma

ok let's pause on the blame thread. the immediate problem is we have no alerting right now. ryan, can you regenerate the token and restore the integration today?

👍3

Ryan Kelly

yes. doing it now

Priya Sharma

👍2

Ryan Kelly

token regenerated. PagerDuty integration is live. running a test alert now

Ryan Kelly

test alert fired and received. we're back online

✅4

David Kimura

worth checking: were there any actual incidents in the last two weeks that we should have been paged for but weren't?

Lena Park

reviewed logs for the past 14 days. nothing that would have triggered a SEV1 or SEV2. we got lucky. the Onfido issue on May 2 was caught manually before it would have auto-paged anyway

😅2

Priya Sharma📌

👍5

Tom Brennan

for the record: not a production incident but this was a real gap. we were lucky. good catch ryan even if the conversation was a bit spicy

😅3

Ryan Kelly

yeah fair. good outcome

Thursday 21 May

Lena Park

weekly check — all good. no incidents, alerting confirmed healthy

✅1

Tuesday 26 May

Tom Brennan

👀3

Ryan Kelly

what endpoints specifically?

Tom Brennan

/v2/payments/initiate and /v2/accounts/balance — both read and write paths affected. started around 10:40

Lena Park

pulling traces from Jaeger. will look at the full request path

Ryan Kelly

my first guess is connection pool exhaustion under load. Vertex integration uses the shared HTTP client pool and there's been a batch job running since 10:30. checking

Lena Park

traces show the latency is in the outbound HTTP call to Vertex, not in our processing. so it's either their side or something in the network path

Ryan Kelly

connection pool looks fine actually. utilisation at 45%, no queue buildup. so not that

Tom Brennan

checked Vertex status page — no incidents reported. going to reach out to their integration team

David Kimura

is this affecting live payments or just latency?

Tom Brennan

all requests completing successfully, just slow. no failures. Vertex Capital payments going through but taking longer. keeping an eye on it

👍1

Lena Park

I've put a 30-minute rolling average on the Vertex API latency in Grafana. if it spikes above 1s I'd want to know immediately

👍3

Tom Brennan

latency has come back down. p99 now 210ms, basically normal. vertex integration team not yet responded. going to keep monitoring through EOD

Ryan Kelly

keeping an eye on it overnight. if it spikes again we'll have more data to work with

Wednesday 27 May

Lena Park

update: saw another spike at 08:15 this morning, lasted about 12 minutes. p99 hit 780ms. then dropped back. pattern is intermittent — not load-correlated from our side

👀2

Tom Brennan

vertex integration team finally responded. they say they're not seeing anything on their end and their internal metrics look clean. they want us to send sample request IDs

Lena Park

sent them 15 request IDs from the two spike windows. waiting on response

Ryan Kelly

David Kimura

is there a business impact we need to communicate to Vertex relationship team or are we ok to keep this at the technical level for now?

Tom Brennan

no failures so far, just latency. I'd say keep it technical for now but if we see failures or it persists past this week we escalate

👍1

Lena Park

vertex responded to the request IDs. they can reproduce slightly elevated latency in those windows on their trace system but have no explanation yet. escalating internally on their side

Thursday 28 May

Ryan Kelly

third spike this morning. 09:05–09:18 UTC. p99 peaked at 910ms. still no failures. this is getting annoying

😤1

Tom Brennan

yeah. three spikes, no root cause, vertex can't explain it. I'm going to document what we know and log this as an open investigation. not enough to call it a formal incident but it needs tracking

Lena Park

agreed. I'll add a spike detector alert — if p99 goes above 600ms for more than 5 minutes, it pages. at least we'll have good data capture going forward

👍3

Ryan Kelly

also worth considering: do we have a circuit breaker on the Vertex client? if this degrades into actual failures we want to fail fast rather than queue up

Lena Park

we don't. I'll raise a ticket. it's a good call regardless of this specific issue

👍1

Tom Brennan

🤔2

Ryan Kelly

BGP route change causing intermittent latency spikes — that would fit the pattern. not load correlated, short duration, happens at irregular intervals. makes sense

Tom Brennan

monitoring through end of week. if vertex confirms the fix I'll update here and close out the investigation.

👍3

Lena Park

no further spikes today. may be settling. keeping the monitor running

Tuesday 2 June

Tom Brennan

✅4

Lena Park

circuit breaker ticket is in backlog: PLAT-4421. spike alert is live in Grafana

👍2

Saturday 6 June

Ryan Kelly

weekly check complete. all integrations healthy, PD alerting confirmed active, no open incidents

✅2

Thursday 11 June

Lena Park

all systems nominal. nothing to report