Table of contents

0. Prefacing remarks
1. Introduction
2. The `socket` library
3. The `select` library
4. Creating an event loop
4.1. A discourse on scheduling policies
4.1.1. Time-sharing scheduler
4.1.2. Related literature
4.1.3. Is it worth it?

Prefacing remarks

Before I get into this, I want to strongly emphasise that no discretion is given to the rigor, intelligence or thought behind mathematical results in this blog. I write each section being lead down rabbit holes by my intuition, and by this point having written (4.1.3) I can conclusively say that a fair amount of the sections were written having learned new things here and there, iteratively improving my approach.

My concrete implementations are available on my GitHub; this blog serves as a trail into how I would intricately pick apart the creative process of designing a highly-scalable HTTP server with no utmost regard for correctness.

Introduction

This article serves as a sort-of log recalling each time I’ve had to rewrite my HTTPS server in Python, and the nuances that have come alongside with it. Of course this serves as no authoritative approach on how to write webservers, but I believe it’s a tried & tested approach. I appreciate your time, and hope to fulfill any reasons you have for reading this article.

https://www.usenix.org/legacy/events/hotos03/tech/full_papers/xu/xu.pdf

I feel obligated to reference this article, because although I’d pride myself on knowing a little about performant systems, I’ve always put threads and asynchronous architectures on more-or-less similar pedestals in terms of (a) scalability and (b) performance (in Python, at least); but I’ve refrained from writing threaded webservers, because I never wanted the unnecessary complexity when in the end, the system is ultimately waiting on I/O resources completely agnostic of how many threads are waiting.

But, in the end, we’re not designing systems that wait around, but systems that perform at maximal throughput, we want to maximize resource responsiveness, i.e. as soon as a client creates a request we want a thread/handler to wake up immediately and serve it, but we also want to work in a confined environment to prevent resource exhaustion. Simply, we want to have resources that scale with usage metrics. Ideally we want to have resources that can handle variable amounts of clients at a time, and be able to load-balance, because we want to minimize thread/asynchronous context switches, and we want to minimize the use of synchronization primitives (in threaded systems, at least) because they’re difficult to manage, and we don’t want to end up unexpectedly serializing our program.

Conclusive from the article (given its publishing date), we can see threading comes more naturally for our concurrency model, but because we aren’t dealing at a level where we control memory allocations, scheduling options or much of anything really, I opt to work with asynchronous architectures, as a way of giving into the fact that the global interpreter lock prevents threading from surpassing asynchronous architectures unless done un-Pythonically (though I’m not sure where I stand on this point.)

Of course, for a webserver, we can’t asynchronize everything unrelated to the principal responsibility that is serving and communicating with clients. It would be more ideal to thread disk I/O, interactive prompts, cache refreshing mechanisms, etc., otherwise facing the peril of bogging down the event loop.

The `socket` library

Obviously, since we aren’t competent enough to write this in C, but neither uninspired enough to use asyncio, Flask or something else, we break even by creating this using the socket library; which then still puts us at the mercy of the programming gods, because subtle technicalities and bugs will creep in unless you know what you are doing.

Assuming we want to build a relatively scalable webserver, we cannot run serialized, i.e. we cannot block longer than necessary for any socket. Therefore, everything down from the server socket to each client socket and anything they do must be non-blocking, at risk of delaying other connections.

import socket

POLLING_DELAY = 1

sv_socket = socket.socket()
# defaults to AF_INET/SOCK_STREAM, i.e. IPv4/TCP

sv_socket.setblocking(False)
sv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
sv_socket.bind((<host>, <port>))
sv_socket.listen(<backlog>)

while True:
	try:
		cl_sock, cl_addr = sv_socket.accept()
	except BlockingIOError:
		time.sleep(POLLING_DELAY)
	...

In the above snippet, I show the semi-canonical form for a non-blocking server socket. A note on my variable naming convention, sv means server and cl means client, though generally it could be regarded as an antipattern, because it means that you’re grouping similar data by variable names, opposed to containers (i.e. classes/dataclasses), this will be addressed in future.

On the note of tuning, the <backlog> value is merely a hint to the underlying socket layer, whether it’s 0 or the maximum OS-specified value doesn’t affect memory usage in user-space, but for a HTTP/1.x implementation it is preferable to have a non-zero backlog, as the protocol is request-oriented, and doesn’t transfer multiple requests along a continuous stream, therefore letting requests pile up in the case of a blocking task on the serverside.

The main grievance with this snippet, aside everything running at the top-level with no discerning between script/module-level, is the time.sleep(...) call, which is effectively serializing connections on a POLLING_DELAY increment, added with the additional overhead of the (almost surely) executed try-except block; it’s simply stinky code, and if the websever were benchmarked on the given POLLING_DELAY of 1, it would also surely fail given an insufficient backlog, if the web resource had more than <backlog> hyperlinks, or if <backlog> connections tried to fetch the page in that sleeping interval.

Our steps forward would be to minimize the server-client connection code, and delegate immediately to a handler. Note that, at this moment, we’re one layer beneath handling the HTTP protocol, we have yet to establish the connection and handle packet fragmentation, TLS, graceful/non-graceful disconnections, and build the handling model.

The `select` library

When using non-blocking sockets, it would be much easier in every way if you could know what the socket is ready for, opposed to poking it to see what falls out, consider:

# ... setup server socket

clients = {}

while True:
	try:
		cl_sock, cl_addr = sv_socket.accept()
		cl_sock.setblocking(False)
		clients[cl_addr] = cl_sock
	except BlockingIOError:
		pass

	for cl_addr, cl_sock in clients.items():
		try:
			if not cl_sock.recv(1, socket.MSG_PEEK):
				raise BrokenPipeError
			data = cl_sock.recv(4096)
			# ... handle `data`
		except BrokenPipeError:
			del clients[cl_addr]
			cl_sock.close()
		except BlockingIOError:
			pass

The primary grievance I have with this structure is that it’s inefficient, it has linear runtime with respect to the number of clients regardless of how many are ready to receive from/be written to. It doesn’t scale well. In a sense, it is semantically similar to the select syscall, but with greater runtime overhead.

Thus, we can resolve this inefficiency through the use of the epoll syscall, which is implemented neatly as select.epoll. To clarify a little detail on the subtlety between epoll and poll is that when you use epoll, you can designate the kernel to awake your poll object during a socket event only once for that event; unlike poll which will persist the notification until you put the socket in a waiting state again, e.g. by emptying the receive buffer.

If the kernel repeatedly notifies a poll object, the poll object is considered level-triggered, otherwise event-triggered, which is nearly the entire difference between epoll/poll syscalls. For our asynchronous dispatch architecture, we prefer event-triggered polling, because we ideally want the handler to handle receiving to determine the payload’s length, in which case we cannot separate fragmented packets as being one and the same or separate higher layer payloads–which is a common issue one may encounter in writing socket servers.

from select import epoll, EPOLLET, EPOLLIN, EPOLLOUT, EPOLLHUP
from typing import Union
from queue import Queue

ClientDict = dict[int, dict[str, Any]]

# ... setup server socket


clients: ClientDict = {}

poller = epoll()
poller.register(sv_socket, EPOLLIN | EPOLLHUP | EPOLLET)

sv_sockfd = sv_socket.fileno()

while not poller.closed:
	for ev_sockfd, ev_mask in poller.poll():
		if ev_sockfd == sv_sockfd:
			if mask & EPOLLHUP:
				poller.close()
				del clients
				break
			# ready to accept, mask must be EPOLLIN
			cl_sock, cl_addr = sv_socket.accept()
			cl_sock.setblocking(False)
			clients[cl_sock.fileno()] = {
				"socket": cl_sock,
				"address": cl_addr,
				"mqueue": Queue()
			}
			poller.register(cl_sock, EPOLLIN | EPOLLOUT | EPOLLHUP | EPOLLET)
			continue
		
		if mask & EPOLLHUP:
			# ... GC calls `socket.__del__`, no other references left
			del clients[ev_sockfd]
			continue

		cl_socket = clients[ev_sockfd]['socket']
		cl_mqueue = clients[ev_sockfd]['mqueue']

		if mask & EPOLLIN:
			# ... get hint from upper layer on how much to read
			#			otherwise resort to just reading in the whole recv. buffer
			# ... still handle EOF error

		if mask & EPOLLOUT:
			# ... read from `mqueue` if non-empty and non-blocking, and send to the
			#			socket. we don't need to be informed by the upper layer because
			#			sending data is inherently our responsibility
	
	# ... do other tasks while we are still actively serving clients

A few remarks:

  1. I store `sv_sockfd` knowing it is a trivial property, I find it neater
  2. Clients stores each client as a 3-element dictionary, this persists the socket's lifetime, and prevents needing to replicate the socket using `socket.socket`'s `fileno` kwarg, which would reallocate an entirely new object. Also, it can be a very subtle bug to find if you lose scope of a socket object and the other side abruptly disconnects
  3. The `mqueue` (message queue) should preferably be accessed via the `*_nowait` alternatives, and handle errors appropriately, in future prospect of a threading architecture
  4. The `del clients[ev_sockfd]` might be a little suspicious, don't rely on the Python GC to do this for you, I write this for code brevity

Now, we’ve resolved our linear-time problem, structured the data a little neater and we’re ready to design the handler. How should we approach this? Should we encapsulate the connection and data handling logic in one event loop? Then we would let the server continue handling connections whilst serving clients, otherwise we leave it to the server to pre-empt handlers, thus we introduce a blocking event, but we will create the architecture to be first-in first-out.

By co-operatively multitasking the connection handling logic with the data handling logic, we effectively place the bottleneck on the event loop’s efficiency, whereas in our former structure, the bottleneck would be on the data handler’s ability to quickly serve and yield control.

In the former solution, we avoid context switch overhead, at the cost of having the n’th client in epoll’s ready queue incur the accumulated latency of the prior (n - 1) data handler calls. This would be an acceptable solution if our processing loop did nothing but process clients, and our data handling mechanism was lightweight, e.g. serving cached static pages, but in the end as soon as we introduce something tangibly blocking we end up serializing our program.

Resulting from the fact that we’re writing our own asynchronous implementation, we can define the threshold for when we consider our event loop to be blocked. For example, our basis for asynchroneity in the above program depends solely on epoll, no matter what we cannot have data-handling coroutines run asynchronous to epoll, and there is no way to exceed the base performance of epoll, as that is an implementation detail by our kernel. Similarly, we define another basis for asynchroneity by the quality of our connection handling code, because again no matter what, any coroutines at higher layers inherently depend on its functionality; a higher layer coroutine (e.g. a layer that waits for HTTP headers that statefully interpets the content and passes it onto a layer that handles the content itself, and so forth) always depends on the ‘blockingness’ of each sublayer.

Should we put the threshold at the layer that interprets the HTTP content? Is that the layer where we should enforce another event loop, because it might branch off into disk I/O coroutines, otherwise dwindling the underlying performance in the connection handling code?

Therefore, this structure, amongst other reasons omitted, is a terrible idea. It ensures requests are served in a FIFO queue, but it couples the layers together far too much behaviourally, which leads to a constricted programming environment, and tunnel-visions too much on quick and simple requests.

We want higher layers to be able to communicate with the connection handling layer, being able to yield to it whenever nothing particularly critical is occurring at that layer, or in other words being able to say that “It’s probably more important to serve other connections than whatever I’m doing right now, for a few iteration cycles at least.”

Creating an event loop

A discourse on scheduling policies

An event loop is the overarching scheduler for every coroutine we want to run asynchronously. We have the most optimizational leeway in this architectural component, because it decides what runs and for how long. Our scheduler can either give full control to its coroutines until they decide it’s time to yield control back (co-operative scheduling), or we can use a timesharing (TSS) scheduler to either assign fixed or informed time-slots for different tasks, which, if guided by the coroutines, can reduce to a co-operative scheduler, but give more power to the scheduler, letting it treat hints for what they are.

There are, naturally, trade-offs between these two scheduling policies. Ultimately the difference is that we’re either assuming the scheduler is smarter than the layers themselves at understanding how many resources they will need in the near future, or we can collate hints from each layer to allocate appropriate time slots for each coroutine depending on how many resources we think we can allocate across the board; or we can have a naive scheduler that distributes fixed time-slots, much easier to implement.

Notice that, like alluded to before, a TSS can reduce to a co-operative scheduler, either if the time slots are too large to decompose small coroutines, or if the coroutines yield themselves to prevent blocking. So, it is categorically more powerful than a co-operative scheduler, but only under certain behavioural conditions.

Time-sharing scheduler

We cannot allocate time-slots so small that the cost to context-switch does not break even in some sense with the quantity of resource utilisation, and in other words useful work done, because in the end if the scheduler is doing more work than the actual coroutines it’s scheduling, then we have an issue. If we decide to go down the path of allowing coroutines to provide metrics by which the scheduler can re-evaluate its own scheduling policy, then we need to have constraints on the scheduler itself.

A nicer way to visualise this constraint problem is that coroutines live in their own resource domain, it might be waiting for a slow client, it might be uploading a large file, it might be interacting with databases or external components; the coroutine understands (more or less) the (i) time-scope, (ii) prioritization level and (iii) deadline of whatever it’s doing. By understanding the time-scope, the coroutine can inform the scheduler of how much time it will approximately need to finish; if the coroutine believes it should hold be in a separate priority group, because of perhaps a need for quicker resource availability to the user, then it can inform the scheduler. Finally, if the coroutine doesn’t necessarily know the time-scope, but knows the maximum time it would allocate itself before deciding that there’s a timeout, then it can let the scheduler know, in order to advise the scheduler on what the worst-case scenario would look like. In the final case of deadlines, the scheduler may opt to consider prioritization groups to simply allow a task to finish, as to not prevent a coroutine from potentially starving in the worst case.

Related literature

During my research on efficient scheduling algorithms for webservers, and in general scheduling theory, I came upon a couple of interesting articles; but, primarily, and possibly most related to my exposition in the prior subsection:

https://www.researchgate.net/publication/220724348_Implementation_of_a_new_Scheduling_Policy_in_Web_Servers

Essentially, a scheduling policy based on the shortest-remaining-response-time (SRRT) of tasks, wherein the data handlers can feed request metrics like round-trip time (RTT), and the TCP congestion window’s (cwnd) size back into the scheduler, providing a heuristic that allows scheduling based the fastest connection, and whichever link has the capacity to receive higher loads in less time based on the congestion window’s size.

This is similar to the idea of the time-sharing scheduler that I describe above, since it’s just an explanation of how the scheduler should create priority groups, advise the scheduler of the time-scope and determine deadlines.

Is it worth it?

In the end, the discussion of these more-or-less complicated scheduling algorithms is moot when its implementation is too costly, or is simply infeasible in a language such as Python. So, is it infeasible? My first consideration is that the pipeline that a request comes through might create a very tangible overhead for scheduling: it must be received by the server, placed into the handler’s recv. queue alongside heuristic information that informs the scheduler’s next decision, and then it must wait for its next time slot to be invoked in order to be handled, by chance to completion.

I theorise that, naturally, at lower orders of throughput, Python will perform worse than simple round-robin, or fully co-operative scheduling policies, simply due to the space complexity that managing heuristics like this requires; though for a solution to this problem, one can have a piecewise scheduling architecture, for different levels of throughput. But, this seems unnecessary, since likely the performance difference between schedulers for low throughput is unnoticable to the end user.

An important question, with regards to our proposed TSS (informed by SRRT), is how we should decide to unify these heuristics into a single value that the scheduler can use to base its decisions. We want to minimize RTT, but maximize cwnd size, because we want a small latency with a large link capacity. Our scheduling must also be deadline-based, as we are lending user-dominant processes (serving requests) and auxiliary processes (e.g. recaching files; any long running processes not serving an immediate user) to the scheduler, and we want to be able to prioritize different tasks based on how much we care about them right now, essentially. As an aside, periodic tasks should be specialized, and be differentiated by the scheduler so that beyond a certain threshold difference in deadline to now, the task should not be given any runtime, but then below that, the regular deadline scheduling policy should apply.

We begin by saying that we have a time-share of 1, of which we fraction out shares $0 \leq p_i \leq 1$ to the i’th task based on its deadline parameter $d_i$, RTT $R_i$ and cwnd $W_i$. Trivially, $\sum p_i = 1$, and we define $d_i$ as a time-period denoting the timeout from the initial invocation of that task (more precisely being $d_i(t)$). In a more-or-less configurable expression, we can have:

\[p_i = \alpha R_i + \beta W_i + \gamma d_i(t)\]

Satisfying $0 \leq \alpha, \beta, \gamma \leq 1: p_i \leq 1$. We want our deadline parameter to overarch the rest of the parameters, because we want a real-time system, therefore we let $\alpha, \beta \propto \gamma^{-1}$. Then, $\alpha$ and $\beta$ are effectively normalizing constants, let us suppose that $R_i$ is a function of time, spanning the duration of the TCP stream, and that $R_i(t)$ is the average RTT up to time t, i.e. more or less:

\[R_i(t + \delta) = \frac{r_i(t + \delta) + \#_i(t)R_i(t)}{\#_i(t + \delta)}\]

Where $r_i(t)$ is the pointwise RTT at time $t$, and #${}_i(t)$ is the number of packets at time $t$.