Monday, December 14, 2020

Old Dog, New Tricks

In their article "File Descriptor Transfer over Unix Domain Sockets" in their blog CopyConstruct, distributed systems engineer Cindy Sridharan references an article by Facebook engineers and others that appeared in the proceedings of a virtual ACM conference: "Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website" [Usama Naseer et al., SIGCOMM '20, 2020-08-10]. It's a real eye opener. I thought I was pretty conversant with the socket interface available in Linux, having been doing that kind of interprocess/interprocessor communication since my 4.2 BSD days on a VAX-11/750. Apparently I have some catching up to do.

Sridharan's article points out that you can transfer open sockets from one process to another, running on the same computer, via a UNIX domain socket, in the address family AF_UNIX (a.k.a. AF_LOCAL). UNIX domain sockets (or what I typically refer to as "local sockets") only work between endpoints on the same computer. Instead of using an IP address and port number, the rendezvous between endpoints is identified using a unique name in the file system namespace. Local sockets are have been around for eons, but I was not aware of this use for them.

What this is not is a child process inheriting a file descriptor from a parent process. This is sending a message over a local socket from one process to another that effectively results in a dup(2) system call on the receiving end for one or more file descriptors from the sending end that are specified in the message. The system calls involved are sendmsg(2) and recvmsg(2).

ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);

ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);

Even the socket type was a new one on me: SOCK_SEQPACKET, which exchanges a message having fixed boundaries like a datagram, but which has guaranteed delivery like a TCP socket.

int listensocket = socket(AF_UNIX, SOCK_SEQPACKET, 0);

I had used sendmsg and recvmsg with struct msghdr before, to send and receive vector I/O using datagrams. You can find examples of this in my test program unittest-ipc-scattergather.c that I described in a previous article "Scatter/Gather".

But struct msghdr has additional fields that can be used to transmit additional ancillary information (the term used in the manual page). This includes control messages like SCM_RIGHTS, which enables the transfer of open file descriptors across process boundaries. (You still have to transmit at least one byte of conventional data, even if it is ignored by the far end; the control message tags along.)

You can find the test program described here at unittest-ipc-ancillary.c. In that program, there is one "main" process, one "workload" process, and four "instance" processes. 

The workload process manages a pool of sixty-four client threads. Each active client thread is trying to connect to the same listen socket identified by an IP address and port number. When the connection is established, the client writes one or more requests, and for each request reads a response from the server on the far end. Then the client closes the connected socket, returns itself to the thread pool, and ends. The workload process continuously waits for a client thread to appear in pool; it removes it from the pool and starts it.

Each instance process uses sendmsg to send a message to the main process over a local socket asking for the listen socket. The listen socket is a conventional socket identified by an IP address and port number, and has pending client threads waiting for their connection to be accepted. The main process receives the request using recvmsg and replies using sendmsg with its open listen socket to one instance process at a time until that process exits and its status is waited for by the main process. Then the connection request of the next instance process is accepted, and it is sent the same open listen socket.

Each instance process manages a dispatcher thread. The dispatcher thread continuously accepts connection requests on the listen socket from client threads, and assigns each new stream socket to one of eight server threads that it has started from a thread pool. When the client thread closes its end of the socket, the server thread returns itself to the thread pool and ends. During a single activation, a server thread services a sequence of requests from a single client.

The main process that drives the test gives each of the eight instance process ten seconds to service as many clients as it can before it is signaled by the main process to exit.

Here is a code snippet lifted from that test program for the sending side. diminuto_ipcl_packet_send() is the function in my Diminuto library that calls sendmsg.

struct iovec vector[1];

struct msghdr message;

union { struct cmsghdr alignment; char data[CMSG_SPACE(sizeof(int))]; } control;

struct cmsghdr * cp;

char dummy[1] = { '\0' };


vector[0].iov_base = dummy;

vector[0].iov_len = sizeof(dummy);

message.msg_iov = vector;

message.msg_iovlen = countof(vector);

message.msg_control = &control;

message.msg_controllen = sizeof(control);

cp = CMSG_FIRSTHDR(&message);

cp->cmsg_level = SOL_SOCKET;

cp->cmsg_type = SCM_RIGHTS;

cp->cmsg_len = CMSG_LEN(sizeof(listensocket));

memcpy(CMSG_DATA(cp), &listensocket, sizeof(listensocket));

ASSERT(diminuto_ipcl_packet_send(activationsocket, &message) == sizeof(dummy));

Here is a code snippet from the receiving side. Similarly, diminuto_ipcl_packet_receive() calls recvmsg. Note that the value returned by both sendmsg and recvmsg is the length of the dummy payload, not the length of struct msghdr or struct cmsghdr.

char dummy[1];

struct iovec vector[1];

struct msghdr message;

union { struct cmsghdr alignment; char data[CMSG_SPACE(sizeof(int))]; } control;

struct cmsghdr * cp;

ssize_t length;

int listensocket = -1;


vector[0].iov_base = dummy;
vector[0].iov_len = sizeof(dummy);

message.msg_iov = vector;
message.msg_iovlen = countof(vector);

message.msg_control = &control;
message.msg_controllen = sizeof(control);

ASSERT((length = diminuto_ipcl_packet_receive(activationsocket, &message)) == sizeof(dummy));

for (cp = CMSG_FIRSTHDR(&message); 
     cp != (struct cmsghdr *)0; 
     cp = CMSG_NXTHDR(&message, cp)) {
  if (cp->cmsg_level != SOL_SOCKET) { continue; }
  if (cp->cmsg_type != SCM_RIGHTS) { continue; }
  if (cp->cmsg_len != CMSG_LEN(sizeof(listensocket))) { continue; }
  memcpy(&listensocket, CMSG_DATA(cp), sizeof(listensocket)); break; 

The test program is constructed such that the listen socket being passed to each instance process doesn't even exist when the instance processes are forked. The only way the listen socket could be known to the instance processes is via the control message mechanism.

This is a remarkable capability. I own Cindy Sridharan a debt of gratitude for bringing it to my attention.

Addendum (2020-12-15)

Fazal Majid was kind enough to pass along the fact that this capability has been around for a long time, and cited a classic reference that I have on my bookshelf just a couple of feet away.

W. Richard Stevens, Stephen A. Rago, Advanced Programming in the UNIX Environment, 2nd ed., Addison-Wesley, 2005

It took me a few minutes to find it: 17.4.2, "Passing File Descriptors over UNIX Domain Sockets", pp. 606-614. I'm embarrassed to admit I missed this. In my defense, the book is 927 pages long. A big thank you to Mr. Majid for pointing this out.

Addendum (2020-12-16)

The SOCK_SEQPACKET socket type looks pretty interesting, doesn't it? It has the reliability of SOCK_STREAM with the fixed message boundaries of SOCK_DGRAM. Why don't we see it used more? Because underneath the hood it uses neither the Transmission Control Protocol (TCP) used for streams, nor the User Datagram Protocol (UDP) used for datagrams, the two transport-layer protocols on which most of the Internet is based. Instead, it uses the Stream Control Transmission Protocol (SCTP) developed for Signaling System 7 (SS7), the telecommunications protocol stack used to set up and tear down telephone calls in the Public Switched Telephone Network (PSTN). SCTP is defined in RFC 4960. SCTP isn't as widely deployed in the Internet at large as its peers in the transport layer. I'm not confident SCTP packets would make it through all firewalls. And it's not that hard to parse out fixed size messages from TCP streams. Diminuto provides an API to create a sequential packet socket only for the UNIX domain (local) address type, where it might be useful for inter-process communication on the same computer.


Fazal Majid said...

I learned this trick in school 30 years ago, never used it. It’s documented in Stevens’ Advanced Programming in the UNIX environment (chapter 17.4 in my ebook copy of the third edition).

Chip Overclock said...

I have that very volume sitting in my book shelf in my home office just a couple of feet from where I'm typing this. But if I ever read that section, I sure don't remember it.