Getting Every Microsecond Out of uWSGI

Development

Reading Time: 12 minutes

In recent articles, I covered performance tuning both HAProxy and NGINX.

Today’s article will be similar, however we’re going to go further down the stack and explore tuning a Python application running via uWSGI.

What Is uWSGI

In order to deploy a web application written in Python, you would typically need two supporting components.

The first is a traditional web server such as NGINX to perform basic web server tasks such as caching, serving static content, and handling inbound connections.

The second is an application server such as uWSGI.

In this context, an application server is a service that acts as a middleware between the application and the traditional web server. The role of an application server typically includes starting the application, managing the application, as well as handling incoming connections to the application itself.

With a web-based application, this means accepting HTTP requests from the web server and routing those requests to the underlying application.

uWSGI is an application server commonly used for Python applications. However, uWSGI supports more than just Python; it supports many other types of applications, such as ones written in Ruby, Perl, PHP, or even Go. Even with all of these other options, uWSGI is mostly known for its use with Python applications, partly because Python was the first supported language for uWSGI.

Another thing uWSGI is known for is being performant; today, we’ll explore how to make it even more so by adjusting some of its many configuration options to increase throughput for a simple Python web application.

Our Simple REST API

In order to properly tune uWSGI, we first must understand the application we are tuning. In this article, that will be a simple REST API designed to return a Fibonacci sequence to those who perform an HTTP GET request.

The application itself is written in Python using the Flask web framework. This application is extremely small, and meant as a quick and dirty sample for our tuning exercise.

Let’s take a look at how it works before moving into tuning uWSGI.

app.py:

''' Quick Fibonacci API '''

from flask import Flask
import json
import fib

app = Flask(__name__)

@app.route("/<number>", methods=['GET'])
def get_fib(number):
    ''' Return Fibonacci JSON '''
    return json.dumps(fib.get(int(number))), 200

if __name__ == '__main__':
    app.run(host="0.0.0.0", port="8080")

This application consists of two files. The first is app.py, which is the main web application that handles accepting HTTP GET requests and determines what to do with them.

In the above code, we can see that app.py is calling the fib library to perform the actual Fibonacci calculations. This is the second file of our application. Let’s take a look at this library to get a quick understanding of how it works.

fib.py:

''' Fibonacci calculator '''

def get(number):
    ''' Generate fib sequence until specified number is exceeded '''
    # Seed the sequence with 0 and 1
    sequence = [0, 1]
    while sequence[-1] < number:
        sequence.append(sequence[-2] + sequence[-1])
    return sequence

From the above code, we can see that this function simply takes an argument of number and generates a Fibonacci sequence up to the specified number. As previously mentioned, this application is a pretty simple REST API that does some basic calculations based on user input and returns the result.

With the application now in mind, let’s go ahead and start our tuning exercise.

Setting Up uWSGI

As with all performance-tuning exercises, it’s best to first establish a baseline performance measurement. For this article, we will be using a bare-bones setup of uWSGI as our baseline. To get started, let’s go ahead and set up that environment now.

Installing Python’s package manager

Since we are starting from scratch, we’ll need to install several packages. We’ll do this with a combination of pip, the Python package manager, and Apt, the system package manager for Ubuntu.

In order to install pip, we will need to install the python-pip system package. We can do so with the apt-get command.

# apt-get install python-pip

With pip installed, let’s go ahead and start installing our other dependencies.

Installing Flask and uWSGI

To support our minimal application, we only need to install two packages with pip: flask (the web framework we use) and uwsgi. To install these packages, we can simply call the pip command with the install option.

# pip install flask uwsgi

At this point, we have finished installing everything we need for a bare-bones application. Our next step is to configure uWSGI to launch our application.

Bare-bones uWSGI configuration

uWSGI has many configuration parameters. For our baseline tests, we will first set up a very basic uWSGI configuration. We’ll do this by adding the following to a new uwsgi.ini file:

[uwsgi]
http = :80
chdir = /root/fib
wsgi-file = app.py
callable: app

The above is essentially just enough configuration to start our web application and nothing more. Before we move into performance testing, let’s first take a second to understand what the above options mean and how they change uWSGI behaviors.

http – HTTP bind address

The first parameter to explore is the http option. This option is used to tell uWSGI which IP and port to bind for incoming HTTP connections. In the example above, we gave the value of :80; this means listen on all IPs for connections to port 80.

The http option tells uWSGI one more thing: that this application is a web application and will be receiving requests via HTTP methods. uWSGI also supports non-HTTP-based applications by replacing the http option with options such as socket, ssl-socket, and raw-socket.

chdir – Change running directory

The second parameter is the chdir option which tells uWSGI to change its current directory to /root/fib before launching the application. This option may not be required for all applications but is extremely useful if your application must run from a specified directory.

wsgi-file – Application executable

The wsgi-file option is used to specify the application executable to be called. In our case, this is the app.py file.

callable – Internal application object

Flask-based applications have an internal application object used to start the running web application. For our application, it is the app object. When running a Flask application within uWSGI, it’s necessary to provide this object name to the callable parameter, as uWSGI will use this object to start the application.

With our basic configuration defined, let’s test whether or not we are able to start our application.

Starting our web application

In order to start our application, we can simply execute the uwsgi command followed by the configuration file we just created; uwsgi.ini.

# uwsgi ./uwsgi.ini

With the above executing successfully, we should now have a running application. Let’s ago ahead and test making an HTTP request to the application using the following curl command:

$ curl http://example.com/9000
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946]

In the above example, we can see the output of the curl command is a JSON list of numbers in a Fibonacci sequence. From this result, we can see that our application is running and responding to HTTP requests appropriately.

Sign up for a free Codeship Account

Measuring Baseline Performance

With our application up and running, we can now go ahead and run our performance test case to measure the application’s base performance.

# ab -c 500 -n 5000 -s 90 http://example.com/9000
Requests per second:    347.28 [#/sec] (mean)
Time per request:       1439.748 [ms] (mean)
Time per request:       2.879 [ms] (mean, across all concurrent requests)

In the above, we once again used the ab command to send multiple web requests to our web application. Specifically, the above command is sending 5000 HTTP GET requests to our web application in batches of 500. The results of this test show that ab was able to send a little over 347 HTTP requests per second.

For a basic out-of-the-box configuration, this level of performance is pretty decent. We can, however, achieve better with just a little bit of tweaking.

Multithreading

One of the first things we can adjust is the number of processes that uWSGI is running. Much like our earlier exercise with HAProxy, the default configuration of uWSGI starts only one instance of our web application.

With our current application, this basically means each HTTP request must be handled by a single process. If we distribute this across multiple processes, we may see a performance gain.

Luckily, we can do just that by using the processes option for uWSGI into the uwsgi.ini file.

processes = 4

The above code will tell uWSGI to start four instances of our web application, but this alone isn’t the only thing we can do to increase our possible throughput.

While tuning HAProxy, I talked a bit about CPU Affinity. By default, uWSGI processes have the same CPU Affinity as the master process. What this means is that even though we will now have four instances of our application, all four processes are using the same CPU.

If our system has more than one CPU available, we are neglecting to leverage all of our processing capabilities. Once again we can check the number of available CPUs by executing the lshw command as shown below:

# lshw -short -class cpu
H/W path      Device  Class      Description
============================================
/0/401                processor  Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
/0/402                processor  Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz

From the output above, our test system has two CPUs available. This means even with four processes, we are only using about half of our processing capability. We can fix this by adding two more uWSGI options, threads and enable-threads, into the uwsgi.ini configuration file.

processes = 4
threads = 2
enable-threads = True

The threads option is used to tell uWSGI to start our application in prethreaded mode. That essentially means it is launching the application across multiple threads, making our four processes essentially eight processes.

This also has the effect of distributing the CPU Affinity across both of our available CPUs.

The enable-threads option is used to enable threading within uWSGI. This option is required whether you use uWSGI to create threads or you use threading within the application itself. If you have a multithreaded application and performance is not what you expect, it’s a good idea to make sure enable-threads is set to True.

Retesting for performance changes

With these three options now set, let’s ago ahead and restart our uWSGI processes and rerun the same ab test we ran earlier.

# ab -c 500 -n 5000 -s 90 http://example.com/9000
Requests per second:    1068.63 [#/sec] (mean)
Time per request:       467.888 [ms] (mean)
Time per request:       0.936 [ms] (mean, across all concurrent requests)

The results of this test are quite a bit different than the original baseline. In the above test, we can see that our Requests per second is now 1068. This is a 207% improvement by simply enabling multiple threads and processes.

As we have seen in previous tuning exercises, adding multiple uWSGI workers seems to have drastic improvements in performance.

Disable Logging

While the most common option, multithreading is not the only performance tuning option available for uWSGI. Another trick we have available is to disable logging.

While it might not be immediately obvious, logging levels often have a drastic effect on the overall performance of an application. Let’s see how much of an impact this change has on our performance before we dig into why and how it improves performance.

disable-logging = True

In order to disable logging within uWSGI, we can simply add the disable-logging option into the uwsgi.ini configuration file as shown above.

While this option may sound like it disables all logging, in reality uWSGI will still provide some logging output. However, the amount of log messages is drastically decreased by only showing critical events.

Let’s go ahead and see what the impact is by restarting uWSGI and rerunning our test.

# ab -c 500 -n 5000 -s 90 http://example.com/9000
Requests per second:    1483.35 [#/sec] (mean)
Time per request:       337.076 [ms] (mean)
Time per request:       0.674 [ms] (mean, across all concurrent requests)

From the above example, we can see that we are now able to send 1483 requests per second. This is an improvement of over 400 requests per second; quite an increase for such a small change.

By default, uWSGI will log each and every HTTP request to the system console. This activity not only takes resources to present the log message to the screen, but also within the code there is logic performing the logging and formatting of the log message. By disabling this, we are able to avoid these activities and dedicate those same resources to performing our application tasks.

The next option is an interesting one; on the surface, it does not seem like it should improve performance but rather degrade it. Our next option is max-worker-lifetime.

Max Worker Lifetime

The max-worker-lifetime option tells uWSGI to restart worker processes after the specified time (in seconds). Let’s go ahead and add the following to our uwsgi.ini file:

max-worker-lifetime = 30

This will tell uWSGI to restart worker processes every 30 seconds. Let’s see what effect this has after restarting our uWSGI processes and rerunning the ab command.

# ab -c 500 -n 5000 -s 90 http://example.com/9000
Requests per second:    1606.62 [#/sec] (mean)
Time per request:       311.212 [ms] (mean)
Time per request:       0.622 [ms] (mean, across all concurrent requests)

What is interesting is that one would expect uWSGI to lose some capacity while restarting worker processes. The result of the test however, increases our throughput by another 100 Requests per second.

This works because this web application does not need to maintain anything in memory across multiple requests. This specific application actually works faster the newer the process is.

The reason for this is simple: A newer process has fewer memory management tasks to perform, as each HTTP requests create objects in memory for the web application. Eventually the application has to clean up these objects.

By restarting the processes periodically, we are able to forcefully create a clean instance for the next request.

When leveraging a middleware component such as uWSGI, this process can be very effective. This option can also be a bit of a double-edged sword; a value too low may cause more overhead restarting processes then the benefit it brings. As with anything, it’s best to try multiple values and see which fits the application at hand.

Compiling Our Python Library to C

Now that we’ve tuned uWSGI, we can start looking at other options for greater performance, such as modifying the application itself and how it works.

If we look at the application above, all of the Fibonacci sequence generation is contained within the library fib. If we were able to speed up that library, we may see even more performance gains.

A somewhat simple way of speeding up that library is to convert the Python code to C code and tell our application to use the C library instead of a Python library. While this might sound like a hefty task, it is actually fairly simple using Cython.

Cython is a static compiler that is used for creating C extensions for Python. What this means is we can take our fib.py and convert it into a C extension.

Let’s go ahead and do just that.

Install Cython

Before we can use Cython, we are going to need to install it as well as another system package. The system package in question is the python-dev package. This package includes various libraries used during the compilation of Cython-generated C code.

To install this system package, we will once again use the Apt package manager.

# apt-get install python-dev

With the python-dev package installed, we can now install the Cython package using pip.

# pip install Cython

Once complete, we can start to convert our fib library to a C extension.

Converting our library

In order to facilitate the conversion, we will go ahead and create a setup.py file. Within this file, we’ll add the following Python code:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("fib.py"),
)

When executed, the above code will “Cythonize” the fib.py file, creating generated C code. Let’s ago ahead and execute setup.py to get started.

# python setup.py build_ext --inplace

Once the above execution is completed, we should see a total of three files for the fib library.

$ ls -la
total 196
drwxr-xr-x 1 root root    272 Dec  5 21:52 .
drwxr-xr-x 1 root root    136 Dec  3 21:05 ..
-rw-r--r-- 1 root root    317 Dec  4 03:22 app.py
drwxr-xr-x 1 root root    102 Dec  3 21:03 build
-rw-r--r-- 1 root root 105135 Dec  5 21:52 fib.c
-rw-r--r-- 1 root root    281 Dec  3 21:03 fib.py
-rwxr-xr-x 1 root root  80844 Dec  5 21:52 fib.so
-rw-r--r-- 1 root root    115 Dec  5 21:51 setup.py

The fib.c file is the C source file that was created by Cython, and the fib.so file is the compiled version of this file that our application can import at run time.

Let’s go ahead and restart our application and rerun our test again to see the results.

# ab -c 500 -n 5000 -s 90 http://example.com/9000
Requests per second:    1744.61 [#/sec] (mean)
Time per request:       286.598 [ms] (mean)
Time per request:       0.573 [ms] (mean, across all concurrent requests)

While the results do not show as much of an increase — 144 requests per second — there is an increase in throughput none the less. As with most things, the results with Cython will vary from application to application.

Summary

In this article, with just a few tweaks to uWSGI and our application, we were not only able to increase performance, we were able to do so significantly.

When we started, our app was only able to accept 347 requests per second. After changing simple parameters, such as the number of worker processes and disabling logging mechanisms, we were able to push this application to 1744 requests per second.

The number of requests is not the only thing that increased. We were also able to reduce the time our application takes to respond to each request. If we go back to the beginning, the “mean” application request took 1.4 seconds to execute. After our changes, this same “mean” is 286 milliseconds. This means overall we were able to shave about 1.1 seconds per request; a respectable difference.

While this article covered most of the available performance-tuning options within uWSGI, there are still quite a few that we haven’t touched. If you have a parameter that you feel we should have explored, feel free to drop it into the comments section.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • ddorian43

    You’re saying that you need the threads to use multiple cpus ? Doesn’t make any sense. Why not just increase the number of processes ?

    Better, add gevent, change processes to number of cores, remove threads, set cpu-affinity=1.

    • madflojo

      @ddorian43:disqus By default with Linux (unless something specifies an affinity), child processes inherit the same CPU allocation as the parent.

      With uWSGI enabling threads will open the option of threads being assigned to either CPU.

      $ ps -eTo cmd,cpuid,pid,ppid | grep ^uwsgi
      uwsgi uwsgi.ini 0 3189 2829
      uwsgi uwsgi.ini 1 3189 2829
      uwsgi uwsgi.ini 1 3193 3189
      uwsgi uwsgi.ini 1 3193 3189
      uwsgi uwsgi.ini 1 3202 3189
      uwsgi uwsgi.ini 0 3202 3189
      uwsgi uwsgi.ini 1 3204 3189
      uwsgi uwsgi.ini 0 3204 3189

      It is not guaranteed but without enabling threads multiple processes do not necessarily get balanced across the CPU’s.

      Also while setting CPU affinity to a specific CPU works in some cases, there are other situations where performance may be gained by distributing the process/cpu allocation. It all depends on the environment being tuned.

  • Using uWSGI threads can be bad for performance in real world applications because of Python’s GIL. I’ve sped up sites by disabling them.

    • madflojo

      One of the things I try to stress is that things like this are situational and you have to experiment to find what works for the environment being tuned. In some cases, threads can add some throughput, in others it can get in the way.

      • Sure but the article doesn’t read like that, it says “enable multiprocessing and enable threading” in the same breath before measuring the difference.

  • Daniil Yarancev

    Did you tried to use PyPy?
    Can you please try it for this example?
    Just don’t use cython

  • Just using Cython on plain Python code doesn’t normally increase performance much – just as your demonstration showed. But if you annotate the types in the Python code using the special Cython syntax then it can increase performance of CPU-bound operations by one or two orders of magnitude.

  • P. Roebuck

    Maybe I’m missing something, but it appears you configured `uwsgi` as the system’s `httpd` replacement (i.e., access via port 80).
    So why are the `curl` commands accessing port 9000 getting responses?

  • Pingback: Optimasi uwsgi – @kholidfu()