Type Classes

Share this post

memfd: An example of Haskell and C

typeclasses.substack.com

memfd: An example of Haskell and C

Today I'm talking about my 'memfd' library, a concise example of FFI in Haskell.

Chris Martin
Mar 17
Share this post

memfd: An example of Haskell and C

typeclasses.substack.com

This article is installment two of two. Part one laid out some important concepts, and this part goes into the inner workings of the memfd library to show how Haskell FFI works.

I had been reading Drew DeVault’s free book, The Wayland Protocol (which I recommend if you’re interested in Wayland). When it comes time to show how to allocate a shared memory buffer for writing frames to be read by the Wayland server, the book recommends copying the following code:

static void randname(char *buf) {
  struct timespec ts;
  clock_gettime(CLOCK_REALTIME, &ts);
  long r = ts.tv_nsec;
  for (int i = 0; i < 6; ++i) {
    buf[i] = 'A'+(r&15)+(r&16)*2;
    r >>= 5;
  }
}

static int create_shm_file(void) {
  int retries = 100;
  do {
    char name[] = "/wl_shm-XXXXXX";
    randname(name + sizeof(name) - 7);
    --retries;
    int fd = shm_open(name, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd >= 0) {
      shm_unlink(name);
      return fd;
    }
  } while (retries > 0 && errno == EEXIST);
  return -1;
}

This does what I demonstrated in part one: It creates a new file (shm_open) and then immediately deletes it (shm_unlink). This is because the author wants a file but not a hard link, and there is traditionally no way to create a file without also creating a hard link.

This snippet is painful because the majority of the code is dedicated to generating a file name. The file name doesn’t matter at all, except that it can’t be the name of an existing file. So this randname function comes up with an arbitrary file name possibility based on the current system time. On the wild chance that this results in the name of a file that already exists, the create_shm_file function above will try the process repeatedly if needed, up to a maximum of 100 times before giving up.

This is a lot of work to put into coming up with a name for something that doesn’t matter and will only exist for a microsecond. The less roundabout way to create a file without a hard link is to use the memfd_create function, which does just that:

memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released.

Imagine my joy at discovering that exactly what I need is already there. But this API was in C, and I was using Haskell!

This is fine; Haskell and C are good pals.

C to Haskell

import Foreign.C.Types (CInt (..), CUInt (..))
import Foreign.C.String (CString)

This is going to be an extremely simple integration, because there is only one C function that I needed. The manual gives this function’s type as:

int memfd_create(const char *name, unsigned int flags);

(You might be wondering why there is a name parameter, since the whole point of using memfd_create is that I didn’t want to have to name the file. This name is not the same as a file name; it only appears in various debugging outputs and otherwise has no effect on the behavior.)

The first thing I did is declare a function in Haskell that corresponds closely to the C function. The Haskell type signature looks like this:

c_create :: CString -> CreateFlags -> IO Fd
  1. The C function returns int; the Haskell function will return Fd. These are exactly the same thing, because Fd is a newtype for CInt, which is the Haskell type corresponding to C’s int.

  2. I’ve chosen to shorten the function name from “memfd_create” to “create” because in Haskell we have slightly better namespace facilities and less need to give functions globally unambiguous names. I have, however, prefixed this name with “c_” to remind us that it’s the C binding (and not what most Haskell code will use directly).

  3. The C function’s first parameter has type char* and the first Haskell parameter is CString (a type alias for Ptr CChar). This is exactly what char* means — In Haskell we just prefer to use words more often instead of symbols, so we write “Ptr” instead of “*”.

    1

  4. I defined the CreateFlags type as a newtype for CUInt, the Haskell word for C’s unsigned int.

newtype CreateFlags = CreateFlags CUInt

The crucial parts of the Haskell type signature I’ve written are almost exactly the same as the C function. The only differences are that I’ve used types with more specific meanings: Fd instead of int, and CreateFlags instead of CUInt. This is permitted because Fd and CreateFlags are newtypes, which means they are representationally equivalent to int and CUInt. An Fd is only nominally different from an int; at runtime, they are the same thing.

I don’t actually write a definition for c_create; instead I declare it as a Haskell binding to the C function. This is not a type of syntax you see every day, but it works out of the box in GHC 2021 with no language extension flags.

2
The binding declaration looks like this:

foreign import ccall unsafe "memfd_create"
    c_create :: CString -> CreateFlags -> IO Fd
  • “foreign import” means I’m declaring a type for a function that comes from another language.

  • “ccall” means this is a “C call;” this is how I specify that the foreign language is C.

  • “unsafe” is added here for a slight performance benefit.

    3

  • “memfd_create” specifies which C function I want to use.

When the Haskell code is built, the compiler will verify that the type signature I’ve given for c_create does indeed match up with the type of the foreign memfd_create function.

Technically I could stop here and the c_create function would be usable, just not very ergonomic. The remainder of the work is packaging it up nicely into a library that lets us use it in a nice Haskelly way.

A polished Haskell API

The problem with c_create is that it requires a lot more effort to use than a Haskell user expects.

  • The CString type is a pointer to a character array. To get a CString, we have to explicitly allocate memory, and then we have to make sure that memory is freed when we’re done with it. Memory management is nothing to a C programmer, but in Haskell it’s the sort of thing we’d prefer to think about as little as possible.

  • What the heck does this “flags” parameter mean, and why is it an integer?

Here’s the Haskell API I really wanted (which you can see at the top of the API documentation):

create :: CreateOptions -> IO Fd

The two C parameters name and flags I’ve combined into one parameter of a type I’ve called CreateOptions. This is a pretty common thing to do when translating an API from some other language into Haskell. Rather than pass around multiple parameters, I’d rather bundle them together into a record. One of the fields in this record is going to be the name:

data CreateOptions = CreateOptions
    { name :: Name
    , ... -- more options will be added later
    }

newtype Name = NameString{ nameString :: String }
    deriving newtype IsString

More options will be added later, but for now assume that CreateOptions is going to contain enough information that we’ll be able to extract CreateFlags from it.

4

createOptionsFlags :: CreateOptions -> CreateFlags

The create function, then — the focal point of the memfd package — will be defined as follows:

import Foreign.C.String (withCString)

create x =
    withCString (nameString (name x)) \name' ->
    c_create name' (createOptionsFlags x)

withCString is one way to convert a Haskell string to a C string. This is a continuation-passing style function that allocates a CString, passes it to the continuation, then frees the CString. This is sufficient to handle the simple memory management needed here.

The remainder of the work pertains to the flags parameter. There is more work here that it may seem.

Flags

The following are excerpts from the documentation. It’s fine if you don’t understand what most of it is talking about; I don’t either. (I use Linux but I’m not really an expert.)

The following values may be bitwise ORed in flags to change the behavior of memfd_create():

  • MFD_CLOEXEC — Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.

  • MFD_ALLOW_SEALING — Allow sealing operations on this file. See the discussion of the F_ADD_SEALS and F_GET_SEALS operations in fcntl(2), and also NOTES, below.

  • MFD_HUGETLB — The anonymous file will be created in the hugetlbfs filesystem using huge pages.

  • MFD_HUGE_2MB, MFD_HUGE_1GB, ... — Used in conjunction with MFD_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB, 1 GB, ...) on systems that support multiple hugetlb page sizes. Definitions for known huge page sizes are included in the header file <linux/memfd.h>.

Unused bits in flags must be 0.

Take a look at that <linux/memfd.h> C header file. The most relevant parts are:

#define MFD_CLOEXEC       0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB       0x0004U

In C culture, flags are how we compactly represent a set of yes/no options. While I was memorizing powers of two in various representations in intro CS, my friends in the industrial design school were practicing drawing circles; all disciplines have some basic mundane skills that become rote before you can be fast at more interesting skills. A native Haskell programmer might not have their powers of two and hexadecimal conversions memorized, just like I can’t quickly draw a precise circle, but a C programmer probably does, and their eye immediately interprets the above list of definitions in binary:

The last two entries in this table are not discussed in this article and not yet included in my memfd library because they did not exist at the time when I published my memfd package.
Screenshot from The Matrix (1999). Cypher describes to Neo how what he sees when he looks at screens full of encoded Matrix information is what he finds attractive in what the code represents, not the code itself.
“I don’t even see the code anymore…” — Two programmers looking at the same code often see something different in it, depending on our own practiced habits and conventions.

In this light you can see that each of the five flags in the table above corresponds to a 1 bit at a particular location. There are then two important notes in the documentation, which are conventions that a C programmer would just assume even if they weren’t explicitly written:

  • “Values may be bitwise ORed.”

  • “Unused bits in flags must be 0.”

To me as a Haskell author, I look at those two statements and what I immediately see is a monoid.

Screenshot from The Matrix (1999). Morpheus says to Neo "look again" and the simulated person he saw a moment ago has now turned into Smith, an agent of the enemy.
“Look again.”
import Data.Bits ((.|.))

instance Semigroup CreateFlags where
    CreateFlags x <> CreateFlags y = CreateFlags (x .|. y)

instance Monoid CreateFlags where
    mempty = CreateFlags 0

Monoid and Semigroup constitute the Haskelly way to say that some type has a zero value a way to combine values.

Constants

Next I have a very boring job, which is to define Haskell constants corresponding to each of the flags from the C header file. I could write it this way, but I won’t:

5

closeOnExec   = CreateFlags 0x00000001
allowSealing  = CreateFlags 0x00000002
hugeTLB       = CreateFlags 0x00000004
hugeTLB_64KB  = CreateFlags 0x40000000
hugeTLB_512KB = CreateFlags 0x4c000000
hugeTLB_1MB   = -- etc.

Instead of manually copying the numbers from <memfd.h>, I’ll let Haskell’s build system do it for me.

#include <linux/memfd.h>

closeOnExec   = CreateFlags (#const MFD_CLOEXEC)
allowSealing  = CreateFlags (#const MFD_ALLOW_SEALING)
hugeTLB       = CreateFlags (#const MFD_HUGETLB)
hugeTLB_64KB  = CreateFlags (#const MFD_HUGE_64KB)
hugeTLB_512KB = CreateFlags (#const MFD_HUGE_512KB)
-- etc.

To enable the #include and #const directives, which are not part of the ordinary Haskell language, this code needs to be in a file with a .hsc extension rather than an ordinary Haskell (.hs) file. The build system will automatically run hsc2hs tool on it, which will look at the <memfd.h> C header file and substitute the appropriate constants, e.g. replacing #const MFD_CLOEXEC with 1 and #const MFD_ALLOW_SEALING with 2. This tool, which essentially generates Haskell code from C header files, is a critical component of Haskell’s good relationship with C.

6

Alongside each of these constants in my Haskell library, I also included documentation, much of which is copied from the original Linux documentation. This is a really important step that you should always remember when you’re writing any kind of translation layer. Your audience (in this case, Haskell coders) does not necessarily know the underlying language (C) well enough to read the original documentation, understand the Linux concepts, and infer how the parts of your library are connected to a Linux API that maybe they’ve never even heard of before. Your library should ideally contain enough comments to be understood on its own.

Guiding users toward success

The design of the memfd library then called for one more big step to get it cleaned up and usable. If you recall, I haven’t yet given a complete definition of the CreateOptions type. It is given below, along with a number of ancillary types that comprise it.

data CreateOptions = CreateOptions
    { name :: Name
    , onExec :: OnExec
    , sealing :: Sealing
    , fileSystem :: FileSystem
    }

data OnExec = CloseOnExec | RemainOpenOnExec

data Sealing = AllowSealing | DoNotAllowSealing

data FileSystem = TemporaryFileSystem | HugeTLBFileSystem HugeTLBOptions

data HugeTLBOptions = DefaultHugeTLB | HugeTLBSize HugeTLBSize

data HugeTLBSize = HugeTLB_64KB | HugeTLB_512KB | HugeTLB_1MB | ...

Rather than simply have the user specify a set of Boolean flags, like the C function does, I’ve provided here a more structured format in which the tree of choices is more explicitly laid out. Each type and each constructor are places where documentation can be attached and conveniently found.

7

The OnExec type corresponds to whether the C MFD_CLOEXEC flag is set. A value of CloseOnExec means the flag will be set, and RemainOpenOnExec means it will not. Rather than “True” and “False,” these constructor names “close on exec” and “remain open on exec” more explicitly state what each choice means. Likewise, Sealing corresponds to whether the MFD_ALLOW_SEALING flag is set.

The FileSystem type is where the Haskell-style design more seriously departs from the C style. This is because I have a complaint with how my C friends do things: Not all possible combinations of flags make sense! The memfd_create function gives us a choice of file system. By default, the file is created using tmpfs, and if the MFD_HUGETLB flag is set, then hugetlbfs is used instead. If hugetlbfs is used, then there is a further option: We may set one of the MFD_HUGE_<size> flags to specify the HugeTLB page size.

8
I find this state of affairs rather upsetting, because there’s nothing stopping me from setting nonsensical combinations of flags, e.g. giving one of the MFD_HUGE_<size> flags without enabling MFD_HUGETLB. I’m not even entirely sure what would happen if I did that. An exhaustive reading of memfd_create is required to know what all the options are and how they interact.

These areas of ambiguity do not exist in the Haskell design, and with the Haskell version I find that it is much easier to see that I have considered each available choice. The TemporaryFileSystem constructor has no fields, which communicates unambiguously that if we use tmpfs there are no additional options. The HugeTLBFileSystem constructor has a field of type HugeTLBOptions, which again very clearly tells a user that further choices are available when hugetlbfs is in use.

It is important to note that the Haskell approach of designing a single type to represent all the options does not force us to sacrifice convenient defaults. My library provides a function to construct CreateOptions that are the same as what the C library gives you if you don’t specify any flags.

defaultCreateOptions :: Name -> CreateOptions
defaultCreateOptions x = CreateOptions
    { name = x
    , onExec = RemainOpenOnExec
    , sealing = DoNotAllowSealing
    , fileSystem = TemporaryFileSystem
    }

So if a C program contained a line like this:

int fd = memfd_create("wayland-buffers", 0);

The equivalent line in a Haskell program would look like this:

fd <- create (defaultCreateOptions "wayland-buffers")

So there you have it! The foreign import of the C function itself was quite simple. The fun creative part (which wasn’t strictly necessary, but gives a safer, easier-to-understand package) was converting the conventional C flag-based parameter into Haskell’s algebraic datatypes.

I’ve only skimmed the surface of Haskell’s foreign function interface here, because this was an incredibly simple interface to cover. Things do get a little more complicated when we start allocating C structs, thinking harder about memory management, and passing callbacks to C functions. But Haskell does provide the tools for these things as well, for mapping Haskell datatypes to C structs, and for registering memory-freeing actions such that Haskell’s garbage collection can appropriately trigger deallocation of memory that was allocated by a C library. But this is enough for today!

1

Well, “Ptr” is almost a word, anyway. It stands for “pointer.” A lot of the older Haskell APIs abbreviate heavily. Please don’t name things like this anymore. Spell out the entire word.

2

In the original release of this article, I erroneously stated that the CApiFFI extension is required; it is not.

3

See Foreign imports and multi-threading in the GHC user manual for details on whether a foreign call should be marked as safe or unsafe.

4

The function that converts CreateOptions to CreateFlags can be found here.

5

It doesn’t matter, but if you’re curious: The hugeTLB constants here can be determined by looking at hugetlb_encode.h (which is imported by memfd.h) and doing some math.

6

If you look in the .hsc file you will also notice a number of #ifdef conditionals. These are used to make the Haskell library adaptable to multiple versions of the Linux memfd API; it has changed over time, and depending on which version of Linux you’re building on, not all of the flags may be available. This was a bit too tedious to discuss in this article.

7

I have not written as much documentation as I would like to have, because as I said, I am not enough of a Linux expert to write it, but the design at least leaves these as obvious places where the documentation ought to go.

8

Again, I’m sorry, I have very little understanding of the Linux details, and I cannot go into any further detail on what exactly HugeTLB is for because it’s just too far outside my knowledge. All I’ve done here is read just enough to be able to translate this API into Haskell.

Share this post

memfd: An example of Haskell and C

typeclasses.substack.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Mission Valley Software LLC
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing