The friendship between Haskell and C

As a brief introduction to how Haskell FFI works, I’ll be talking about my memfd package, which is available on Hackage. Part 1 of this article lays out concepts; part 2 will explain the package.

Mar 16, 2023

If ever there were two programming languages said to be at odds with one another, it might be Haskell and C. But this is not so true as it seems; they can play quite nicely with one another. Haskell’s foreign function interface lets us write Haskell code that uses libraries of other languages — notably, C.

As a brief introduction to how Haskell FFI works, I’ll be talking about my memfd package, which is available on Hackage. This article is part 1 of 2, laying out concepts and motivation. Part 2 will go over the memfd package, the Linux API that it uses, and how their friendly relationship is expressed in Haskell.

If you want to support my Haskell open source work, one great way is to subscribe to this publication.

File descriptors in the abstract

The word ‘file’ arises because files are often used for persistent storage like papers in a cabinet. When a process opens a file, it receives a file descriptor (FD). Conceptually, this is a reference to the file; literally, it is an integer. This integer — only within the context of the process that it belongs to, and only from the time the FD is opened until the time it is closed — refers to a particular file. The process uses the FD to read from or write to the file.

But we must get past storage cabinets and learn to think of a file more abstractly as an inter-process communication channel. An FD is an integer that, within the context of a process, abstractly identifies some resource, and there are many kinds of resources. Some examples of file descriptors:

The standard input, output, and error streams (stdin, stdout, and stderr) that every process implicitly starts with as FDs 0, 1, and 2. Depending on how the process was initialized, each of these streams might represent a persistent file, or it might represent another process.
When a network client opens a connection to a server, the server process receives an FD representing the socket that it uses to communicate with the client.
When a Wayland graphical application starts, it first connects to a UNIX-domain socket to initiate contact with the graphics server, receiving a socket FD it will use to send messages such as “a new frame is ready to display.”
The next thing a Wayland client does is open a file (creating another FD) into which it will write the image data that it wishes to display on screen.
It then sends that FD over the socket to the Wayland server. This results in the creation of yet another FD within the Wayland server process. Both FDs refer to the same file, thus establishing another form of inter-process communication (one that is faster than the socket for transmitting large amounts of graphical data).

Sockets make the terminology awkward, because a socket is not a file, but an integer that we use to identify it to it is called a file descriptor anyway. A socket has a file descriptor that doesn’t correspond to a file. This state of affairs is said to be quite elegant, and perhaps it is, though the nomenclature is just painful.

My last three examples of file descriptors all pertained to Wayland because that was my motivation for writing the memfd package, the discussion of which is forthcoming.

File descriptors in Haskell

In the System.Posix.Types module of the base package, we find the following definition:

newtype Fd = Fd CInt

A file descriptor is, truly, only a number. This is a newtype for CInt, which is called a “C Int” because it corresponds to the “int” type in C. This is how we can start to see that Haskell and C are friends; Haskell’s standard library has definitions like this to let us talk about C using C’s own terms.

Side note: A Handle is related to but not quite the same as a FD. The Handle type is what you will generally be using to write cross-platform code; it is in some ways more abstract. The Fd type is what we use for Unix-specific work. If you need to turn an Fd into a Handle, you can use fdToHandle.

File systems

The fact that there is more than one type of file system rose to the attention of computer laypersons with the proliferation of floppy disks. When you bought a pack in the store, its label indicated whether it was formatted for PC or Mac (nobody cares about the Linux users). The magnetic disk didn’t come from the store in a truly blank state; it was set up with the initial structure that the computer needs to see that the disk is blank. Since Microsoft and Apple chose different file systems, slightly different disks were produced for each market segment. Linux users thought this was silly, why didn’t normal people just format the disk for themself when they got it home like we did, why are they outside having fun while we format our disks, and why did they not invite us.

The distinctions between PC/Mac file systems are somewhat trivial, merely differing in implementation details. The overall purpose of them is the same: to arrange data on the disk. If this is all a file system means to you, it’s time to broaden your mind to encompass other kinds:

tmpfs looks like a normal disk file system, but it’s a mirage; this system is backed by volatile storage (RAM) rather than persistent storage. If you need to write a file temporarily but don’t need it file to persist indefinitely, tmpfs is appropriate because it’s faster. If you need the file to not persist indefinitely, tmpfs is appropriate because if you forget to clean up your garbage, it will always get cleaned up automatically next time the system restarts.
sshfs also looks like a normal file system, but it’s backed by another file system on another computer. If you can use SSH to access a remote computer, you can use sshfs to map the remote computer’s files into your own system’s directory tree to pretend like their stuff is yours. It’s pretty neat.
procfs isn’t a general-purpose storage system at all, but rather a means of reading information about your computer. It appears in most systems as the /proc directory, which contains mostly a bunch of text files. For example, /proc/meminfo shows how much RAM you have, and /proc/<process id>/environ shows all the environment variables for a running process. I recommend poking around in there some time, because you can find a fascinating amount of stuff.

Anonymity and garbage

Everything I have thus far described as a file has a file path. When we create a file, we create it within a directory and with a name. The directory and name together constitute an address by which we can open the file later. The association of a file, a name, and a directory is called a hard link.

A somewhat less-considered notion is that a file can have more than one hard link. You can use the ln command-line utility to give a file additional hard links. Doing so does not copy the file, nor does it create a situation wherein one of the paths is the real one and the other merely a pointer (such a pointer is called a symbolic link). The file is simply linked into the file system in more than one place.

The association of a file, a name, and a directory is called a hard link.

One way to delete a file is to use unlink. (The rm utility is more commonly taught because it has more features, including the ability to delete directories.) But this doesn’t necessarily result in the destruction of the file; it only removes a hard link. If there are other hard links, then the file still exists. Only once a file’s hard link count reaches 0 is the file really gone.

No, I lied — A file can exist without any hard links at all. I’ll give a quick demonstration. The following requires base, directory, and filepath.

import Prelude
import System.Directory
import System.FilePath
import System.IO

main = do
    dir <- getTemporaryDirectory      -- 1
    let file = dir </> "demo.txt"

    h <- openFile file ReadWriteMode  -- 2
    removeFile file

    start <- hGetPosn h               -- 3
    hPutStrLn h "Hello!"

    hSetPosn start                    -- 4
    hGetContents h >>= putStrLn

Look up the system’s default temporary file location and construct a file path.
Create a file, and then immediately unlink it.
Mark the position at the start of the file, then write a message.
Reset the file handle to the start to read back the message and print it.

Thus we see that one can go on using a file even after it has been completely removed from the file system. So the real condition under which the operating system can garbage-collect a file is when it has no remaining hard links and no open file descriptors.

Why this is interesting

The first reason it matters to understand this is to understand the importance of not writing software with resource leaks. If you have a long-running process that forgets to close a Handle or two, you might think: How big a deal could that possibly be?

"It's one banana, Michael. What could it cost? $10?"

If you were expecting that file to get deleted, what it could cost is however much space that file takes up. As long as you have an open Handle, the operating system has to keep the entire content of that file until your process ends.

That is a good reason to use ResourceT in any situation where you’re dealing with a file. (This is the focus of chapter 1 of Sockets and Pipes.)

The second reason that anonymous files are interesting is that they’re not always accidents! Sometimes this is what you want. This is the case for the Wayland example described earlier. A Wayland client creates a file to store its graphics and then sends the file descriptor to the Wayland server. That file now functions as a shared memory region that the client writes to and the server reads from.

There is no reason for such a file to ever be hard linked!

A common pattern for Wayland applications is to do what I did in the silly example code earlier: Create a file and then immediately delete it. But this is unnecessary, because we can do better.

How? See part two: