The Great API Gateway Migration

The story behind how we migrated one of the key components of our system to an open-source solution

Hai Le Gia

Published in

Technology @ Funding Societies | Modalku

9 min readJul 8, 2020

Better to spend more time sharpening your axe. Photo by Jason Abdilla on Unsplash

Background

We did not always have a standalone API Gateway.

The first version of the application that we launched in 2015 was a .NET monolith that served all the borrowers and investors on our platform. Authentication, authorization, and business logic resided in the same place. This made sense at that point because we had a really small team of engineers and this set-up enabled us to be nimble and move fast.

However, over the course of the subsequent year, more engineers joined our team in both Singapore and Indonesia. By mid-2016, it became clear to us, given the size of the team and the product roadmap at hand, that the monolith was quickly becoming a serious bottleneck. It was getting hard to orchestrate the deployment of new features especially because coordination could only happen asynchronously over Slack, considering the distributed nature of the team.

To tackle this problem, we decided to move to a service-oriented architecture. The core components were rewritten as NodeJS web services, and many other ancillary services were written in either Python or Java. It took us around six months to migrate the whole system. It was in early 2017 that we launched our micro-services backed production system. It also marked the debut of our own in-house API Gateway built on top of the Spring framework.

High-level of our architecture with our in-house API Gateway

With this new architecture, our engineers could now work more efficiently. We had smaller teams, each responsible for one or more of the new micro-services. They could develop and deploy new features themselves whenever they wanted as long as they could keep the API contracts intact. At this point, our public API set was still small. We were mostly focusing on rewriting and optimising our business logic. There were not many requests to add new endpoints to the API Gateway or update the configuration of existing ones. Therefore, we felt that we were in good shape.

At the end of 2018 however, our API Gateway had become the new bottleneck. Our core systems were now stable after months of work but we now had a backlog of feature requests from Business & Product teams to take on. Many of these features, which were targeted at either external users or internal staff members, required the addition of new APIs to the Gateway. It was at this juncture that the limitations of the custom API Gateway became apparent.

I must clarify first that I don’t think there is anything wrong with the API Gateway pattern. The problem was with the way we had built our in-house API Gateway. It was not designed to be extensible or configurable. There was no Admin API available for either run-time operations or inspection. To add a new API endpoint or to enable a new behaviour (eg. rate-limiting) for an existing endpoint, we had to add some glue code, then rebuild and re-deploy the service which was a slow and painful process.

The code snippet above illustrates how a new route was exposed on the Gateway

While the design had seemed reasonable in early 2017, we realised that we had outgrown it when we observed the following limitations:

It was not easy to develop on

Even though we had a micro-services architecture, the custom Java Gateway project was still shared across the team. If a feature required new API endpoints to be added to the Gateway, developers had to branch from the master branch; add in their code to expose the API endpoints; and open a pull request. It was not a lot of work but was still cumbersome. There were also times when we had to cherry-pick or deal with merge conflicts.

It was not easy to test with

We had a single QA environment where only one version of the API Gateway could run. Since each feature that had API endpoints to expose required its own code branch and an equivalent build of the Gateway, the QA engineers were forced to test features in a sequential manner, which was not ideal.

It was not easy to inspect

Our Product Security team wanted to run automated tests periodically against the Gateway configuration of all Internet-facing API endpoints to identify misconfigured access control/security rules. While all these endpoints were consolidated in the custom API Gateway, the registry could not be programmatically fetched and analysed very easily. We had to write a program that used Java Reflection to extract the information that we needed but this approach had its limitations.

It was not easy to observe

The custom Gateway produced a lot of application logs but it was not available in a structured format (e.g., JSON). Hence, our DevOps team had to spend a lot of time and effort to extract the data to build dashboards and add required alerts. However, the set-up was brittle and error-prone as there were many edge cases that weren’t accounted for.

With all these challenges in mind, we decided that it was time to upgrade our standalone API Gateway. We could either rewrite it or look for a suitable solution externally (by adopting a Proudly Found Elsewhere mindset). As the API Gateway is an important component in our architecture, we knew that replacing it would require a lot of human collaboration. We were also aware that we had already spent a lot of effort in building our custom API Gateway and it was one of the most stable (albeit inflexible) components of our system. However, we all agreed that moving to an extensible, community-maintained solution could boost our productivity and allow us to focus more on our core competencies. Hence, we recruited senior team members from the different squads and kickstarted the Open Source Gateway project internally, codename OHGEE!.

Market Research

The first step was to enumerate all the requirements that we wanted to have in the new solution. Here are some of them:

Support for our existing authentication mechanism and previously issued access tokens.
Support for delegated authentication (e.g., Google), which would be required for the Gateway used by our internal applications.
Extensibility through plugins that we can write.
Support for structured logging (e.g., JSON) for requests & responses.
Configurability via an Admin API, a configuration file (declarative), or the database.
Strong community support.

At that time (late 2018), these were some of the good open-source API Gateway projects that we found:

We evaluated each of these solutions against our requirements above and came up with the following matrix:

Feature Matrix (numbers collected in November 2018)

At this point, Kong clearly stood out. While we were impressed by KrakenD, it was still a fairly young project for us to adopt. We wanted to give it more time to see how it would go on to mature.

Meanwhile, we decided to set up a Kong cluster to try out its features and evaluate its performance. We agreed that we would consider moving to Kong only if it passed our performance evaluation.

Performance Evaluation

Our goal was to measure the overhead that Kong was going to add to the response latency of requests to the backend micro-services and also to gauge the median throughput that we can expect to achieve.

Components

Resource server: nginx web server configured to serve a static file
Kong server (located on the same virtual subnet as the resource server)
Load test client: wrk accessing Kong over the public Internet

Test Scenarios

Base scenario: Client calls the resource server directly
Test scenario 1: Client calls a public API (no authorization) of the resource server via Kong
Test scenario 2: Client calls a private API (with authorization) of the resource server via Kong

Results

# Base Case
$ wrk -c 400 -d60s -t200 --latency .../test.json
...
  200 threads and 400 connections
  ...
  Latency Distribution
     ...
     90%  583.05ms
     99%    1.38s
  ...
Requests/sec:   2894.53
...
# Scenario 1
$ wrk -c 400 -d60s -t200 --latency https://*****/test.json
...
  200 threads and 400 connections
  ...
  Latency Distribution
     ...
     90%  531.99ms
     99%    1.29s
  ...
Requests/sec:   2793.61
# Scenario 2
$ wrk -c 400 -d60s -t200 --latency \
    -H 'Authorization: Bearer ************' \
    https://*****/test_secured.json 
...
  200 threads and 400 connections
  ...
  Latency Distribution
     ...
     90%  598.01ms
     99%    1.36s
  ...
Requests/sec:   2590.24

The latency distribution looked good to us at both the 90th and 99th percentiles for the three scenarios tested. We now decided to proceed with the migration to Kong.

Deployment

For the deployment, we had some non-negotiable requirements:

Canary Release: Ability to selectively roll-out to some users or cohorts. Ability to stagger roll-out as a percentage of requests received. Eg. 20% of all requests going to Kong. This threshold had to be configurable.
Easy Rollback: Ability to revert quickly in case of any unforeseen problems.
No Downtime: Users should ideally not be aware of this change happening behind the scenes.

To achieve these, we decided to go with the following architecture:

We used two Kong clusters: One (Kong 1) with just a custom routing plugin that could help us perform the canary roll-out. The other (Kong 2) was our primary cluster with all our routes registered and configured with custom plugins. Kong 1 would route a given request to either the custom Java Gateway or Kong 2 depending on the configuration of its routing plugin. We could update its behaviour dynamically using Kong’s nifty plugin configuration API.
The first step was to route all requests to Kong 1 instead of our custom Java Gateway. There was no downtime involved in making this switch. We simply updated the DNS record on Cloudflare to point to the new load balancer that now proxied to Kong 1 instead of the old load balancer that had been proxying to the Java Gateway.
We started our canary roll-out by only enabling Kong 2 for our internal team members. Issues that were highlighted on Slack were investigated and resolved. We slowly increased the percentage of requests going to Kong 2 by 20% every 2-3 days. While doing this, we kept a close eye on Customer Support tickets coming in to ensure that any issues that were being faced by our customers were not inadvertently caused by the ongoing migration. After two weeks, all the requests were routed to Kong 2. We still kept the set-up running for another two weeks until we were confident that Kong 2 was working well. Then, we executed another switch on Cloudflare by pointing the DNS record to the load balancer that proxied to Kong 2 directly. Finally, we removed all the extraneous components in the migration path and killed the custom Java gateway.

Key Takeaways

It is almost always a bad idea to accept the first solution that you come across. While it may be tempting to do so as a symptom of a ‘Get things done’ mindset, it is best to not immediately act on that impulse. Instead, list all the possible solution candidates; set the evaluation criteria; evaluate; and then pick the best solution.
Planning is key to the success of a migration. ‘Failing to plan is planning to fail’, as they say.
Know that things can and will go wrong during migrations. Do canary roll-outs for your critical components to minimise risk.

And that concludes the story of how we planned and executed our migration to Kong. Hope you found this useful.

I would like to use this opportunity to thank my colleagues: Quang, Yubo, and Nikolay for working with me during the migration and for helping with the ongoing maintenance of the Kong cluster and our custom plugins.

Giving Back

We are also proud to have been able to contribute back to the Kong community in a small way: Nikolay found an issue with the way the multipart plugin processes requests with multiple files. He went on to address this in a pull request which got merged and will likely be released along with Kong 2.0.5.

Do check out Nikolay’s post on how we set up a seamless workflow for our engineers to control the registration of Kong routes.