close
close

Hashing passwords at speeds of 1500 requests per second and higher – Tech – yle.fi

Hashing passwords at speeds of 1500 requests per second and higher – Tech – yle.fi

Large interactive live events can create many challenges for authentication and login services. Find out how the Yle ID team improved login performance using AWS Lambda and Rust.

The Yle ID team here at Yle (the Finnish broadcaster), which the author is part of, is responsible for creating the account services used across all Yle digital products and services. There are currently just over three million Yle IDs registered.

In this post, we explore an issue we encountered during a large live interactive television event where the number of password hashing operations caused our authentication backend to fail. We’ll start with a quick introduction to password hashing, then look at the challenges it poses for our services and how we’ve overcome them using AWS Lambda and Rust functions.

Background

First, a little about storing passwords and what problems it causes.

When a user registers for an account with a service, their password is not stored in clear text in the application’s database. Instead, to protect against password leaks in the event of a data breach, we rely on cryptographic hash from the password, which is then saved along with the rest of the account information. The algorithms used to calculate these hashes are designed in such a way that, given the resulting hash, it is impossible to reverse the process and get back the original password. This way, even if someone gets hold of the saved account data, they won’t be able to log in using the account.

Another important property of hash algorithms is that they require a lot of resources (computing power or memory, or both). This is done specifically to prevent password brute force: if a database containing password hashes is compromised, an attacker can try to calculate hashes for different passwords until they get one that matches one of the hashes in the database. Finding a matching hash would mean that the password they guessed was correct. If the chosen algorithm is too fast or uses too little memory, you can try many different passwords in a very short time, increasing the likelihood of finding a match. An attacker could try to speed up the process by using lists of pre-calculated hashes of popular passwords, called rainbow tables. This type of attack is prevented salting password before the hashing operation, but that is beyond the scope of this post.

So, if we don’t save the user’s password, how can we login? It’s simple: when the user later provides their password when logging in, we again calculate the same hash and compare it with the one we stored in the database to determine if the given password was correct. This means that we have to perform expensive hash calculations every time a user tries to login. (And also when he resets his password or a new user registers for an account.)

Problem

In a normal situation, hashing these passwords is not a problem since the volume of these requests is quite small, but in case of large events there can be huge spikes in CPU-intensive activities (login, registration, password reset). For example, during a large live event, the host might invite viewers to log into the Yle app to participate in some way (chat, vote, play a game, etc.). Such requests can be directly seen as a sharp increase in the number of requests received by our server (Fig. 1).

Requests per second for endpoints that perform password hash calculations during a large real-time interactive event. It is clear from the graph when viewers were invited to take part in the broadcast.

Image caption
Figure 1. Requests per second for endpoints performing password hash calculations during a large interactive live event. It is clear from the graph when viewers were invited to take part in the broadcast.

During one of these real-time events, a larger-than-expected spike in request rate exhausted all available processors in our cluster, causing the backend to grind to a halt. Compounding the problem was the fact that we recently increased our password hashing resource requirements to match those recommended for modern applications.

The backend was busy calculating all these password hashes, which meant it couldn’t respond to other requests either. The obvious solution to this problem is to allocate even more resources to the application. However, this only works when we can anticipate a surge in traffic. What about unexpected events, such as important news? Of course, we could always keep a lot of capacity on hand in case something happens, but that would be very wasteful and expensive.

Solution

So, let’s summarize the problems with the current architecture:

  • We have one operation that uses the most resources (CPU) in the application.

  • This operation can completely crash the entire application.

  • The operation is mandatory and cannot be made less resource intensive without compromising security. ¹

  • The load can vary greatly and be unpredictable.

Our backend is built in Scala, and for historical reasons we used a third party implementation of the password hashing algorithm in Scala. We ran several tests comparing this implementation with Bouncy Castle and the native JDK implementation of the same algorithm, but found no significant differences in resource consumption. We also compared with the Rust version of the algorithm and found it to be more performant than the JVM implementations. We then ran into the problem of integrating a Rust component into our application, and while the Rust version was faster, it alone would have only pushed the problem further. With more users and bigger events happening, simply using a faster implementation of the same order will only buy us some time, but won’t solve the underlying problem.

After thinking about the problem, we came to the conclusion that we need to isolate the hash calculation in a separate microservice. We hypothesized that using a platform that automatically scales based on request frequency, such as AWS Lambda, would free us from having to worry about provisioning the right amount of resources. While calculating the hash will still take a relatively long time compared to other operations, from the main backend’s perspective it will now be an HTTP request during which it will be free to serve other requests.

We created a prototype implementation of this new microservice to test our assumptions. It was a really small and simple HTTP API that took the user’s password, did a hash calculation, and returned the resulting hash. With this prototype in hand, we ran a performance test by simulating traffic during an event where problems occurred. Tests were conducted with the same amount of resources that were provided during the event. Although the tests only used login requests, they would still give us a good performance estimate since the resource usage of any other operation is negligible compared to computing a password hash.

Running the test with event traffic volume (1500 requests/s) did not cause problems for the new implementation. We were able to increase the speed to 3000 requests per second before we started noticing errors in the response codes. Response times were really good, with the 90th percentile hovering around 300 milliseconds (Figure 2). (Remember that the hash calculation is intentionally slow, and most of the time is spent on the algorithm itself.) And even with the overhead of the added HTTP request, the new version still provides slightly faster response times than the old one.

Performance test results with the new password hashing microservice at 3000 requests/s.

Image caption
Rice. 2. Performance test results with a new password hashing microservice at a speed of 3000 requests per second.

Success! A 100% increase in productivity on the same infrastructure is exactly what we hoped for. The errors we saw during load testing were related to other parts of the infrastructure, not the slowness of the password hashing algorithm. Quite often, when you fix one bottleneck, another one is found elsewhere, but this just shows how effective our solution is, and we’ve been successfully running even larger load tests ever since.

Now, to prepare the implementation, we’ve added some additional parameters to the microservice that allow us to handle any future changes to the hashing algorithm (and existing users are still using older versions of the hash). We also forced the main application to fall back to the hash calculation itself in case the microservice was unavailable for some reason, such as network problems. An additional benefit of this change is that we can now reduce the CPU allocation for the main application, which, even with the cost of the Lambda function, reduces our overall operating costs for the service.


1

At the time the problems arose, password hashing was a necessary evil. We have since implemented the ability to log in without a password using a one-time code sent via email, which does not create the same problems.