Building filesystem servers in user-land for protocols like AFS or NFSv4

A friend of mine who dabbles (to put it lightly) on AFS was telling me this weekend about a problem he’d had to contend with.

AFS, like NFSv4, can multiplex multiple users on the client side onto one connection, but unlike NFS servers, AFS servers [OpenAFS] are typically built to run in user-land. Because of this and because of differences in filesystem semantics between POSIX (which NFS approximates) and AFS, AFS servers typically implement not only the server but the whole filesystem in user-land, completely avoiding the underlying OS’ own filesystems.

So one cannot normally share non-AFS filesystems with AFS, and one cannot use AFS filesystems except through the AFS protocol. But my friend Jeff Hutzelman needed to do just that, so he built a prototype to demonstrate that it was possible. [UPDATE: an HTTP URL for hostafs http://www.cs.cmu.edu/afs/cs.cmu.edu/project/systems-jhutz/hostafs/]

This reminded me of what used to be a pet project of mine for Solaris that has not panned out. There is a class of applications for which a feature of Windows and some other other OSes, per-thread credentials, can be very useful. Namely, multi-threaded server applications running in user-land that need to perform certain operations as though they were performed by a client user process, not the server process. File servers like OpenAFS are a good example of this. Now, OpenAFS doesn’t actually need such a feature: its filesystem is implemented in user-land as well, so there’s no need for the kernel to provide the server with such a facility.

Well, my team had what I thought was a use case for this facility in Solaris, and so I tried to get a project going to add per-thread credentials support to Solaris. Long story short: there are too many little and big semantics and backwards compatibility problems to resolve, so we had to find an alternative solution. Of course, alternatives existed, and we settled for one: fork() separate worker processes, one per-client user, from a privileged one, adopt the credentials of the client users in the child processes, and run the operations in question in the worker processes.

Now, I’m blogging this a) because I think the problems we ran into when with per-thread credentials are interesting, b) to explore the alternatives and design patterns.

I may blog about (a) in-depth in the future. For now suffice it to say that the problems range from “what credentials will signal handlers, shared object .init sections, etc… run with?” to “what happens if a library, unbeknownst to the application, creates threads from a thread running with different credentials than the process itself, but the library also doesn’t know about per-thread credentials?”

Some of these have relatively simple answers, others cannot be resolved with ease or at all. When you mix multiple security contexts in one VM address space you have to be very careful. In particular you have to make sure that every component of the environment knows about this and follows agreed upon protocols. Multi-user OS kernels like Solaris and Linux are examples of programs that work like that (but then, they’re not user-land programs).

But the world of multi-threaded Unix user-land has always assumed a single credential used by all threads in the same process (setuid(2) affects all threads in the caller’s process, though it does not impact in-progress system calls), and violating this assumption could cause lots of security bugs when using code written when that assumption was valid; the system is too open and too large to make it safe to violate this assumption (except for the simplest of programs). As the enormity of the problem of making this safe for the use case that we had in mind sank in we simply had to settle for more orthodox alternatives.

As for (b), I’ve written a small prototype of a library that implements: a worker process abstraction and variants of system calls that take an argument specifying a user to run the system call as (actually, the prototype does this only for open(2): doas_open()).

Additionally, for the fun of it and to explore some of the problems in (a), I have an LD_PRELOADable shared object that implements an illusion of per-thread credentials.

This sort of facility (e.g., doas_open()) could be very useful in building multi-threaded fileservers for protocols like AFS and NFSv4, running in user-land but sharing filesystems implemented by the OS, rather than filesystems implemented in the same user-land program: such programs need to be able to run system calls like open(2) with the credentials that correspond to the clients on behalf of which the server calls open(2). This prototype provides just that, albeit using fork()ed worker processes under the hood (but the IPC mechanism used is doors, which is very fast). (Note that I/O system calls like read(2), write(2), getmsg(2), putmsg(2), recvmsg(2), pread(2), pwrite(2), and so on don’t need to be proxied through worker processes: what matters is the credential used to create the file descriptor in question, not the credentials of the caller these I/O calls.)

One of the limitations of my friend’s prototype AFS server is that it can’t perform local filesystem operations as the clients. Perhaps my prototype can help.

The library consists of two main functions:

  • doas_alloc(), which returns a handle to a worker process and takes a description of the user’s credentials as an argument
  • doas_free(), which releases handles create by doas_alloc()

and then doas_open(), which is just like open(2), but augmented with a worker process handle argument.

Actually, the handle represents a user credential, not a worker process — that a worker process is used is an implementation detail.

The interface for the per-thread credential emulator (which requires interposing on open(2) and friends) consists of a single additional function: doas() which takes a handle and a callback function and data arguments — calls to open(2) and friends in the callback function and all functions it calls in the same thread will be changed to calls to doas_open() with the handle passed to doas().

So there you have it. A framework for building function wrappers that take as an additional argument a “user impersonation token” (to borrow from Windows terminology), portable to any Unix that has IPC methods that allow for file descriptor passing. And a framework for emulating per-thread credentials, written with zero extensions to the underlying OS, portable to any Unix whose linker/run-time linker supports interposition.

The API looks like this:

int doas_open(int idx, char *fname, int oflag, mode_t mode);
/* get current doas subject index */
int  doas_current(void);
/*
 * get a new subject index for a new subject represented by the setup proc and
 * setup_data (which are called to setup a worker process' environment)
 */
int  doas_idx_new(int (*setup_proc)(void *), void *setup_data);
/* release a subject index */
void doas_done(int idx);
/*
 * arrange for interposers called by this same thread to proxy syscalls to the
 * worker process associated with the given subject index
 */
int  doas(int idx, int (*doas_func)(void *), void *doas_data);

It could be improved somewhat. For example, the handles shouldn’t be int but int64_t, and the library should make sure no handles are reused in the same process so it can detect dangling references to dead handles.

The code for forking worker processes is interesting to me because it might be a re-usable design. The parent process creates a door, forks a child and then the child creates its door and sends it to the parent via a door call on the parent’s door. That may seem like a complicated dance, but it isn’t really, and I believe that the asymmetry that leads to is is central to the doors IPC story.

The door server forker consists of:

  • A rendez-vous point — the function that fork()s the child has to wait for the child to call the parent’s door; the parent’s door server function will run in a separate thread, so a simple condition variable will do
  • A door server function in the parent
  • A function to setup the rendez-vous, start the parent door server, fork() the child and wait for it, create the child’s door, pass it back to the parent and enter door_return(3DOOR)
  • Setting the child worker process’ credentials is done through a callback in this prototype, but it could be done with an octet string token description of the credentials to us

The rendez-vous and the function that forks the workers look like this (pardon the printf()s):

struct doas_fork_rendez_vous {
pthread_cond_t cv;
pthread_mutex_t lock;
int idx;
int dfd;
int this_dfd;
};
static
int
fork_doas_door(int idx, int (*setup_proc)(void *), void *setup_data)
{
pid_t pid;
int dfd;
struct doas_fork_rendez_vous rv;
if (idx >= ndoors)
return (-1);
if (pthread_cond_init(&rv.cv, NULL) != 0)
return (-1);
if (pthread_mutex_init(&rv.lock, NULL) != 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
return (-1);
}
(void) pthread_mutex_lock(&rv.lock);
rv.idx = idx;
rv.dfd = -1;
dfd = door_create(doas_fork_parent, (void *)&rv, DOOR_UNREF);
rv.this_dfd = dfd;
if (dfd < 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
return (-1);
}
if ((pid = fork()) < 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
(void) door_revoke(dfd);
return (-1);
}
if (pid == 0) {
int dfd2 = -1;
int failed = 0;
door_arg_t darg;
door_desc_t dd;
(void) memset(&darg, 0, sizeof (darg));
darg.data_ptr = NULL;
darg.data_size = 0;
/* child */
dfd2 = door_create(doas_proc, NULL, DOOR_UNREF);
/*
                 * If setup_proc == NULL we're setting up a worker
                 * process with the same characteristics as the parent,
                 * e.g., so the parent can drop privileges but retain
                 * a door to this privileged worker process.
                 */
if (setup_proc != NULL && setup_proc(setup_data) < 0) {
/* failure */
failed = 1;
} else {
dd.d_attributes = DOOR_DESCRIPTOR;
dd.d_data.d_desc.d_descriptor = dfd2;
darg.desc_ptr = &dd;
darg.desc_num = 1;
}
if (door_call(dfd, &darg) < 0) {
(void) door_revoke(dfd2);
exit(1);
}
if (failed) {
(void) door_revoke(dfd2);
exit(1);
}
/* XXX should rv.cv/rv.lock be cleaned up in the child?? */
/* Service the door */
(void) door_return(NULL, 0, NULL, 0);
exit(1);    /* shouldn't happen */
/* NOTREACHED; */
}
/* parent -- wait for child to pass back its door */
printf("Parent is going to sleep on cv\n");
(void) pthread_cond_wait(&rv.cv, &rv.lock);
printf("Parent is back from sleeping on cv\n");
/* we no longer need the door over which the child passed its door */
(void) door_revoke(dfd);
/* cleanup */
(void) pthread_mutex_unlock(&rv.lock);
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
/* save the child's door */
return (doors[idx] = rv.dfd);
}

The door server function in the parent looks like this:

/* ARGSUSED */
static
void
doas_fork_parent(void *cookie, char *argp, size_t arg_size,
door_desc_t *dp, uint_t n_desc)
{
struct doas_fork_rendez_vous *rv;
rv = (struct doas_fork_rendez_vous *)cookie;
if (arg_size == 0 && argp == DOOR_UNREF_DATA) {
printf("doas_fork_parent() unref\n");
door_revoke(rv->this_dfd);
door_return(NULL, 0, NULL, 0);
}
if (n_desc == 1)
rv->dfd = dp->d_data.d_desc.d_descriptor;
/* wake up the parent */
printf("doas_fork_parent() here!\n");
(void) pthread_mutex_lock(&rv->lock);
(void) pthread_cond_broadcast(&rv->cv);
(void) pthread_mutex_unlock(&rv->lock);
printf("doas_fork_parent() cond_broadcast done...!\n");
(void) door_return(NULL, 0, NULL, 0);
return;
}

The rest is left as an exercise for the reader. Though if this really is useful I can arrange to post the whole thing (probably under the CDDL).

~ by nico on December 15, 2006.

Leave a Reply

Your email address will not be published. Required fields are marked *