» More on the design and implementation of Solaris’ ID mapping facility, part 1: kernel-land Cryptonector

More on the design and implementation of Solaris’ ID mapping facility, part 1: kernel-land

UPDATE: The ZFS FUID code was written by Mark Shellenbaum. Also, something someone said recently confused me as to who came up with the idea of ephemeral IDs; it was Mike Shapiro.

Now that you know all about ephemeral IDs and ID mapping, let’s look at Solaris ID mapping more closely. Afshin has a great analogy to describe what was done to make Solaris deal with SMB and CIFS natively, you should not miss it.

Let’s begin with how the kernel treats Windows SIDs and ID mapping.

[Note: the OpenSolaris source code web browser interface cannot find the definitions of certain C types and functions, so in some places I’ll link to files and line numbers. Such links will grow stale over time. If and when the OpenSolaris source browser interface is fixed I may come back to fix these links.]

SIDs in the kernel

First we have
$SRC/uts/common/os/sid.c.
Here you can see that the kernel does not use the traditional SID structure or wire encoding. Instead Solaris treats SIDs as ksid_t objects consisting of an
interned domain SID (represented by ksiddomain_t) and a uint32_t RID. The prefix is just the stringified form of the SID (S-1-<authority>-<RID0>-<RID1>-…<RIDn>) up to, but excluding the last RID.

Treating SIDs as a string prefix and an integer RID is a common thread running through all the Solaris components that deal with SIDs, except, of course, where SIDs must be encoded for use in network protocols. Interning is used where space or layout considerations make small, fixed-sized objects preferable to variable-length SID structures, namely: in the kernel and on-disk in ZFS.

The ksidlookupdomain() function takes care of interning SID prefixes for use in-kernel. The interned SID prefix table is just an AVL tree, naturally.

The SIDs of a user are represented by credsid_t,
which contains three SIDs plus a list of SIDs that is akin to the supplementary group list. credsid_t objects are reference counted and referenced from cred_t. This is done because the Solaris kernel copies cred_t objects quite often, but a cred_t’s SID list is simply not expected to change very often, or even ever; avoiding unnecessary copies of potentially huge SID lists (users with hundreds of group memberships are common in Windows environments) is highly desirable. The crdup() function and friends take care of this.

Back to sid.c for a moment, lookupbyuid() and friends are where the kernel calls the idmap module to map SID<->UIDs. But we’ll look at the kernel idmap module later.

Note that not all ephemeral IDs are valid. Specifically, only ephemeral IDs in ranges allocated to the running idmapd daemon are considered valid. See the VALID_UID() and VALID_GID() macros. Kernel code needs to be careful to allow only non-ephemeral UIDs/GIDs in any context where they might be persisted across reboots (e.g., UFS!), or to map them back to SIDs (e.g., ZFS!); in all other cases kernel code should be checking that any UIDs/GIDs are valid using those macros. The reason that the VALID_UID/GID() checks are macros should be instantly clear to the reader: we’re optimizing for the expected common/fast case where the given ID is non-ephemeral, in which case we can save a function call. Wherever neither SIDs nor ephemeral IDs can be used the kernel must substitute suitable non-ephemeral IDs, namely, the ‘nobody’ IDs — see crgetmapped(), for example.

Can you spot the zones trouble with all of this? All this code was built for global-zone only purposes due to time pressures, though we knew that eventually we’d need to properly virtualize ephemeral IDs and ID mapping. Now that we have a zoned consumer (the NFSv4 client, via nfsmapid(1M)), however, we must virtualize ID mapping so that each zone can continue to have its own UID/GID namespace as usual. The fix is in progress; more details below.

BTW, the sid.c, cred.c code and related headers was designed and written by Casper Dik.

SIDs in ZFS

Next we look at how ZFS handles SIDs.

Take a look at $SRC/uts/common/fs/zfs/zfs_fuid.c. This is where FUIDs are implemented. A FUID is ZFS’s way of dealing with the fact that SIDs are variable length. Where ZFS used to store 32-bit UIDs and GIDs it now stores 64-bit “FUIDs,” and those are simply a {<interned SID prefix>, <RID>} tuple. Traditional POSIX UIDs and GIDs in the -..2^31-1 range are stored with zero as the interned SID prefix. The interned SID prefix table, in turn, is stored in each dataset.

Here too we see calls to the idmap kernel module, but again, more about that below. And you can see that ZFS keeps a copy of the FUID table in-kernel as an AVL tree (boy, AVL trees are popular for caches!).

If I understand correctly, the ZFS FUID code was written by Mark Shellenbaum. The idea for FUIDs came from Afshin Salek. I’m not sure who thought of using the erstwhile negative UID/GID namespace for dynamic, ephemeral ID mapping.

And you can also see that we have some zone issues here also; these will be addressed, as with all the zone issues mentioned here, in a bug fix that is currently in progress.

I’ll leave VFS/fop and ZFS ACL details for another entry, or perhaps for another blogger. The enterprising reader can find the relevant ARC cases and OpenSolaris source code.

The idmap kernel module

Finally we look at the idmap kernel module. This module has several major components: a lookup cache, the basic idmap API, with door upcalls to idmapd, and idmapd registration/unregistration.

The idmap kernel module is fairly straightforward. It uses ONC RPC over doors to talk to idmapd.

Unfortunately there is no RPC-over-doors support in the kernel RPC module. Fortunately implementing RPC-over-doors was quite simple, as you can see in kidmap_rpc_call(). The bulk of the XDR code is generated by rpcgen(1) from the idmap protocol .x file. The code in $SRC/uts/common/idmap/idmap_kapi.c is mostly about implementing the basic ID mapping API.

The module’s cache is, again, implemented using an AVL tree. Currently the only way to clear the cache is to unload the module, but as we add zone support this will no longer work, and we’ll switch, instead, to unloading the cache whenever idmapd exits cleanly (i.e., unregisters), which will make it possible to clear the cache by stopping or restarting the svc:/system/idmap service. Also, we’ll be splitting the cache into several to better support diagonal mapping.

Finally, I’ll briefly describe the API.

The ID mapping APIs are designed to batch up many mapping requests into a single door RPC call, and idmapd is designed to batch up as much database and network work as possible too. This is to reduce latency in dealing with users with very large Windows access tokens, or ACLs with many distinct ACE subjects — one door call for mapping 500 SIDs to POSIX IDs is better than 500 door calls for mapping one SID to a POSIX ID. The caller first calls kidmap_get_create() to get a handle for a single batched request, then the caller repeatedly calls any of the kidmap_batch_get*by*() functions to add a request to the batch, followed by a call to kidmap_get_mappings() to make the upcall with all the batched requests, or the caller can abort a request by calling kidmap_get_destroy(). APIs for non-batched, one-off requests are also provided. The user-land version of this API can also deal with user/group names.

The idmap kernel module was written mostly by Julian Pullen (no blog yet).

A word about zones

As I mentioned above, we need to virtualize the VALID_*ID() macros, underlying functions, and some of the ksid_*() and zfs_fuid_*() functions. We’re likely going to add a zone_t * argument to the non-batch kidmap API functions and to kidmap_get_create(), as well as to the VALID_*ID() macros, related functions, and affected ksid_*() functions. The affected zfs_fuid_*() and cr*() functions already have a cred_t * argument (or their callers do), from which we can get a zone_t * via crgetzone(). The biggest problem is that it appears that there exists kernel code that calls VOPs from interrupt context(!), with a NULL cr, so that we’ll need a way to indicate that the current zone is not known (or, if in SMB server context, that this is the global zone); the idmap kernel module will have to know to map IDs to the various nobody IDs (including Nobody SID) when no zone is identified by the caller.

In the next blog entry I’ll talk about the user-land aspects of Solaris ID mapping.

~ by nico on November 13, 2007.

Posted in General

One Response to “More on the design and implementation of Solaris’ ID mapping facility, part 1: kernel-land”

Great stuff Nico, thanks for sharing the details of the internals ! Very valuable data for anyone that is going to play with this part of the code.

I see that your next topic will be on the userland aspects, I am looking forward to it. Possibly part of it will be about the thread model of the daemon … if that has any relevance ?

Serge said this on November 16, 2007 at 06:20 | Reply

Cryptonector