Square status - Australia

Performance Issues: Payments
Incident Report for Square AU
Postmortem

Incident Summary

Beginning at 12:32 AEDT 20/11/2021, Square payment processing in Australia experienced a two-hour period of elevated declines on Visa, Eftpos, and Mastercard-branded transactions.

A database crash at an external vendor catalysed existing capacity issues within Square’s systems. Payments processing slowed and some transactions failed to complete within an acceptable time frame. A second, similar incident occurred two weeks later (04/12/21) while investigations into the first incident were still ongoing. The second incident was triggered by a hardware failure within one of Square’s data centres.

We have since updated our systems to improve Square’s processing capacity and resilience to external latency.

Timeline

20/11/2021 (AEDT)

12:32 - Payments begin timing out from heightened latency.
12:37 - Engineering is notified of the incident through automated alerting.
12:38 - Engineering begins assessing the situation and relaying information to Square’s customer support team.
1:01 - Engineering executes a remediation effort that lowers latency.
1:05 - Payments traffic largely returns to being healthy, with a small trickle of timeouts remaining.
2:40 - Payment timeouts return to healthy levels.

04/12/2021 (AEDT)

11:20 - Payments begin timing out from heightened latency.
11:22 - Engineering is notified of the incident through automated alerting.
11:24 - Engineering begins assessing the situation and relaying information to Square’s customer support team.
11:35 - Engineering executes a remediation effort that lowers latency.
11:40 - Payment timeouts return to healthy levels.
11:50 - Payments again begin timing out from heightened latency.
12:20 - Payments traffic largely returns to being healthy, with a small trickle of timeouts remaining.
12:30 - Payment timeouts return to healthy levels.

Analysis

While the first incident was triggered by external problems, it’s clear that capacity issues on Square’s side were a major contributor, and the sole factor behind the second outage. Outlined below are the two key problems we’ve since identified and the corresponding solutions we’ve implemented:

Problem 1: Application Capacity
As transactions started taking longer to process, Square hit the limit on the number of active transactions that the application layer could handle. Otherwise healthy transactions were delayed and reached the external vendor with insufficient remaining time.

Solution 1
We’ve permanently increased the application-level processing capacity by 10x, allowing more transactions to be handled simultaneously.

Problem 2: HSM Capacity
The protocol we use to communicate with the external vendor requires generating a Message Authentication Code (MAC) for each transaction; a process that involves a Hardware Security Module (HSM). Squares HSMs have processing limits that we reached during the incident.

Solution 2a
We rebalanced traffic through our HSMs, directing more traffic to the region with higher capacity.

Solution 2b
We eliminated an unnecessary call to the HSM that occurred on each transaction which was effectively doubling the load.

Posted Feb 08, 2022 - 16:20 AEDT

Resolved
The issues surrounding payments with our upstream banking partners has been resolved.

Thank you for you patience as we worked to get this issue resolved.
Posted Nov 20, 2021 - 13:28 AEDT
Investigating
Our teams have confirmed reports of payments declining for some Sellers, related to issues with upstream banking partners.
Our Engineering team is aware and are working in conjunction with our partners to implement a fix. We will post more updates as we receive them.
Posted Nov 20, 2021 - 13:16 AEDT
This incident affected: Payments.