When I used to work for Opera, we started a project called Opay in early 2018, which was a digital bank and fintech solution for the African market. At the time, our eyes were on Nigeria, Kenya, etc. We had a banking license, but we knew it would be a long road to get to our intended destination, and we wanted to start and get into the market as quickly as possible. To do that, we would stand on the shoulders of giants.
The first giant was Flutterwave. Flutterwave would process our incoming and outgoing transactions. Incoming debit card payments and outgoing direct bank transfers. We had to pay transaction fees, which was pricy considering we were marketing ourselves to customers as having no fees (acquisition strategy). But we didn't mind, as long as it helped us roll out quickly. And it worked very well; until it didn't. We would always have a consistent percentage of transactions that failed due to provider-specific reasons, and this was not good for us as an early-stage product, since it made us lose the customers we were working so hard to acquire.
So, we did what any reasonable engineer would do. We implemented a fallback to another giant. This giant was Paystack. They also had a fee. Maybe even a higher fee at the time, but we just wanted a fallback that would work. The same thing happened; even the fallbacks would fail sometimes too.
So we explored the next fallback, a direct integration with a bank, Zenith bank. It worked as well until it wouldn't.
Then, we eventually went directly to the source and implemented an integration with NIBSS. NIBSS is the Nigerian Interbank Settlement System. We couldn't have any more fallbacks than that.
At a point, we had 4 levels of fallbacks in place. 4 giants in place, each backing up each other.
Learning 1. If your core business critically depends on a third-party, you had better implement fallbacks early.
Unfortunately, having so many integrations and fallbacks can be a lot to worry about, for an early startup.
Problem 2. Choice
Having four levels of fallbacks was not a magical solution. We needed to rank and order these integrations correctly.
- Which of them should be Integration 1?
- In what order should the fallbacks be?
To do this, the transaction fees of the different providers would be important. We wanted to pay less fees, but also wanted to use a provider with lower error rates. Preferably one with consistent performance and latency.
For this, we had to analyze the providers like crazy. Both the errors, but also the performance over time. Treating the requests and errors from the providers as data to be analyzed, and as a general metric to answer the questions like:
- What is the expected error rate for Provider X (eg Flutterwave)?
- Are the errors sporadic? Or is there a trend?
- Are they silly breaking changes from the provider that could have been avoided?
- Is the performance and latency consistent or is it sporadic?
Then we could weigh in the other factors and decide based on hard numbers, to choose the right provider. Sometimes, these stats could give interesting insight, eg, a provider having a higher error rate, specifically around noon on weekdays, could show that maybe they have their peak traffic around noon and have trouble managing the peak, so maybe its safer to use them off-peak only.
Learning 2. Maintain metrics about your integrations performance and errors. So you can make intelligent decisions about what to let go of, and what to keep.
Problem 3. Critical path
Early on, we depended solely on SMS OTPs for every login. Good for security but also a good UX when you get the SMS immediately and when the app can automatically read the SMS and input the OTP.
I remember a Saturday of being on call and the provider we used (One of the largest global players) was down for whatever reason. They were down for almost an entire day, and it was a lot of panic, with my Engineering Manager at the time spending hours trying to get the company on the phone while we were getting tons of support calls and messages from customers who were unable to login and unable to access their money. (Never keep users away from their money! They won’t be nice. )
Learning 3. It’s better to limit the third parties you depend on for the critical path (Yes, even global providers). E.g. maybe send both SMS and Email OTPs? And have fallbacks?
Problem 4. SLOs. Lol. Integration partners lie too.
When there were failures, it could easily be a game of cat and mouse to find what was causing the error at the time. Was it us? Or was it them? Which of them? Then, we needed to know what kind of errors were happening. Were particular errors happening consistently over time? Was it a preventable error?
Did a provider push a “minor” API change with no notice to us, hence breaking our systems? How do we identify what exact change they made so we can quickly implement support for the change while under fire and dealing with unhappy customers?
And this experience was not unique to my team. I remember a friend complaining about being woken up at 1 a.m. due to cascading errors. A dependency they relied on had changed something without notifying them. He didn't know this, so he spent time under pressure investigating why they had failures at 1 a.m. to realize that their failures were due to a particular integration.
The errors stopped not long after, but by morning, when he brought it up to the company that caused the failures, the team denied that nothing happened at all. Imagine how frustrating that was. Especially when you have proof that the errors were due to that integration.
What he was lacking to back his claim, was the exact changes that happened. If they had logged the responses from the integration, they might have been able to show the exact fields which had changed from the perspective of the integration, and how it impacted them.
Learning 4. Tracking Error metrics might not be enough. You need details about the different errors and even raw requests and responses. But this is just too much data. So you need a system that can diff them in some way, and detect exactly what changed.
Breakage is not ideal, but we live in an imperfect world.
As someone who has been on the serving end of building systems others rely on, I know it’s difficult and a very hard trade-off to offer very reliable systems. To make a system very reliable, you first have to decide how much reliability you even mean. The industry has the concept of the 9s to indicate different levels of reliability guarantees companies are offering. I think google offers five 9s. which is 99.999% uptime.
To have such a level of uptime, you first need to have a lot of redundancy, which costs money. A lot of money. Is it just database redundancy? What about server redundancy? Then what about regional redundancy? Etc.
So I don’t expect that systems won’t break. But as engineers who live in an imperfect world, we also want an easier life. Our customers are going to complain if our systems are down or broken, so we need to at least minimize that breakage as much as possible. But more importantly, we want to be the first to know that there is breakage, even before our customers, so we already are doing something about it.
We also want to know what kind of breakage it is. Is it something outside of our control? Or is it something we could do something about to remedy the problems?
This is a topic I’m thinking a lot about, and I would like to share more about. Especially since I’m building tools in an effort to solve some of these challenges.
Problem 5. A lot of Breakages are Silent
In a project I maintained, I used to depend on a particular date/time library. It was basically sugar around popular date time operations and would allow parsing date time strings, through a predefined set of rules.
But when it came to parsing and rendering date strings it had a preference for US date-time strings (MM/DD/YYYY). So there was a conversation on GitHub for months, to switch to the more popular date format (DD/MM/YYYY) as the default.
There was a long deprecation and announcement cycle before the default was effected. But like a lot of people, I don’t keep up with the GitHub conversations of every single library I use. So when this happened, our systems broke. And they broke silently.
Assuming you had an endpoint /users/:userID, and you have users birthdays stored in your db. In your tests you tested against 01/01/23 So when the upgrade happened, your tests continued to pass. Because 1st jan is correct on both formats.
"birthday":"01/01/23", // 1st January, 2023
"name": "John Doe"
After the update, a user like that would experience no breakage.
But for a user whose birthday date was on 12th January 2023, the API call would return:
"birthday":"01/12/23", // // Originally 12th January, 2023, but now it means 1st December, 2023
"name": "John Doe"
So now you get a silent error that no one notices, and your calculations is just wrong until weeks later when a lot of damage has been done, someone detects this.
Learning 5: I’m not sure tbh. Silent errors like this are difficult to identify. And my work at apitoolkit.io is heavily focused on detecting many issues including these.
Monitoring is a Half-Solution
Just like in our initial days at Opay, we could implement detailed monitoring. With every failure or anomaly detected, get a ping or an alert. But that often results in alert fatigue. Every little hiccup, genuine or false positive, will ping you, and soon, you’ll be ignoring these alerts or, even worse, shutting them off.
And there’s another problem. Most monitoring solutions just inform you that something is wrong. They dont necessarily tell you what exactly is wrong. Is the error on your side? Or is it on the side of one of the many services you’re integrating? And even more, they only detect obvious errors and can’t detect silent errors like the field changes we mentioned above.
In the last 2 years, I’ve been thinking a lot about these problems, especially because my friends are also in similar situations and fighting many of these battles daily. Solving this is a journey, and one step in that direction is via APIToolkit, a project I have been working on that already puts some of these thoughts into practice by tracking errors on your applications, building a model of your incoming and outgoing APIs, and validating all requests against that model to detect even those silent issues.
It’s basically a Backend Real-time Quality Assurance platform masquerading itself as an API management system. And I hope to share more about what it looks like so far. For now, you can take a look at https://apitoolkit.io, and I hope to continue finding new solutions to some of these thoughts, and hopefully make managing the Shege a little easier for us Developers.
Working in this space, I have sort of become a collector of stories from engineering teams who face this Shege, and I hope to share some of their stories more and more. It sometimes surprises me how much of the same issues Engineering teams tend to face.
Have you faced similar or related issues in your career, working with APIs? I would love to hear them too!