DLL hell

•June 9, 2009 • 5 Comments

A definitive treatise on coping with DLL hell (in general, not just in the Windows world whence the name came) would be nice.

DLL hell nowadays, and in the Unix world, is what you get when a single process loads and runs (or tries to) two or more versions of the same shared object, at the same time, or when multiple versions of the same shared object exist on the system and the wrong one (from the point of view of a caller in that process) gets loaded. This can happen for several reasons, and when it does the results tend to be spectacular.

Typically DLL hell can result when:

  • multiple versions of the same shared object are shipped by the same product/OS vendor as an accident of development in a very large organization or of political issues;
  • multiple versions of the same shared object are shipped by the same product/OS vendor as a result of incompatible changes made in various versions of that shared object without corresponding updates to all consumers of that shared object shipped by the vendor (this is really just a variant of the previous case);
  • a third party ships a plug-in that uses a version of the shared object also shipped by the third party, and which conflicts with a copy shipped by the vendor of the product into which the plug-in plugs in, or where such a conflict arises later when the vendor begins to ship that shared object (this is not uncommon in the world of open source, where some project becomes very popular and eventually every OS must include it);

At first glance the obvious answer is to get all developers, at the vendor and third parties, to ship updates that remove the conflict by ensuring that a single version, shipped by the vendor, will be used. But in practice this can be really difficult to do because: a) there’s too many parties to coordinate with, none of whom budgeted for DLL hell surprises and none of whom appreciate the surprise or want to do anything about it when another party could do something instead, b) agreeing on a single version of said object may involve doing lots of development to ensure that all consumers can use the chosen version, c) there’s always the risk that future consumers of this shared object will want a new, backwards-incompatible version of that object, which means that DLL hell is never ending.

Ideally libraries should be designed so that DLL hell is reasonably survivable. But this too is not necessarily easy, and requires much help from the language run-time or run-time linker/loader. I wonder how far such an approach could take us.

Consider a library like SQLite3. As long as each consumer’s symbol references to SQLite3 APIs are bound to the correct version of SQLite3, then there should be no problem, right? I think that’s almost correct, just not quite. Specifically, SQLite3 relies on POSIX advisory file locking, and if you read the comments on that in the src/os_unix.c file in SQLite3 sources, you’ll quickly realize that yes, you can have multiple versions of SQLite3 in one process, provided that they are not accessing the same database files!

In other words, multiple versions of some library, in one process, can co-exist provided that there’s no implied, and unexpected shared state between them that could cause corruption.

What sorts of such implied, unexpected shared state might there be? Objects named after the process’ PID come to mind, for example (pidfiles, …). And POSIX advisory file locking (see above). What else? Imagine a utility function that looks through the process’ open file descriptors looking for ones that the library owns — oops, but at least that’s not very likely. Any process-local namespace that is accessible by all objects in that process will provide a source of conflicts. Fortunately thread-specific keys are safe.

DLL hell is painful, and it can’t be prevented altogether. Perhaps we could produce a set of library design guidelines that developers could follow to produce DLL hell-safe libraries. The first step would be to make sure that the run-time can deal. Fortunately the Solaris linker provides “direct binding” (-B direct) and “groups” (-B group and RTLD_GROUP), so that between the two (and run-path and the like) it should be possible to ensure that each consumer of some library always gets the right one (provided one does not use LD_PRELOAD). Perhaps between linker features, careful coding and careful use, DLL hell can be made survivable in most cases. Thoughts? Comments?

Automated Porting Difficulties: Run-time failures in roboported FOSS

•December 12, 2008 • Leave a Comment

As I explained in my previous blog entry, I’m working on a project whose goal is to automate the process of finding, building and integrating FOSS into OpenSolaris so as to populate our /pending and /contrib (and eventually /dev) IPS package repositories with as much useful FOSS as possible.

We’ve not done a good job of tracking build failures due to missing interfaces in OpenSolaris, though in the next round of porting we intend to track and investigate build failures. But when we tested candidate packages for /contrib we did run into run-time failures that were due to differences between Linux and Solaris. These we mostly due to:

  1. FOSS expected a Linux-style /proc
  2. CLI conflicts

The first of those was shocking at first, but I quickly remembered: the Linux /proc interfaces are text-based, thus no headers are needed in order to build programs that use /proc. Applications targeting the Solaris /proc could not possibly build on Linux (aside from cross-compilation targeting Solaris, of course): the necessary header, <procfs.h>, would not exist, therefore compilation would break.

Dealing with Linux /proc applications is going to be interesting. Even detecting them is going to be interesting, since they could be simple shell/Python/whatever scripts: simply grepping for “/proc” && !”procfs.h” will surely result in many false positives requiring manual investigation.

The second common run-time failure mode is also difficult to detect a priori, but I think we can at least deal with it automatically. The incompatible CLIs problems results in errors like:

Usage: grep -hblcnsviw pattern file . . .

when running FOSS that expected GNU grep, for example. Other common cases include ls(1), ifconfig(1M), etcetera.

Fortunately OpenSolaris already has a way to get Linux-compatible command-line environments: just put /usr/gnu/bin before /usr/bin in your PATH. Unfortunately that’s also not an option here because some programs will expect a Solaris CLI and others will expect a Linux CLI.

But fortunately, once again, I think there’s an obvious way to select which CLI environment to use (Solaris vs. Linux) on a per-executable basis (at least for ELF executables): link in an interposer on the exec(2) family of functions, and have the interposer ensure that the correct preference of /usr/gnu/bin or /bin is chosen. Of course, this will be a simple solution only in the case of programs that compile into ELF, and not remotely as simple, perhaps not even feasible for scripts of any kind.

I haven’t yet tried the interposer approach for the CLI preference problem, but I will, and I’m reasonably certain that it will work. I’m not as optimistic about the /proc problem; right now I’ve no good ideas about how to handle the /proc problem, short of manually porting the applications in question or choosing to not package them for OpenSolaris at all until the upstream communities add support for the Solaris /proc. I.e., the /proc problem is very interesting.

Massively porting FOSS for OpenSolaris 2008.11 /pending and /contrib repositories

•December 10, 2008 • 1 Comment

Today is the official release of OpenSolaris 2008.11, including commercial support.

Along with OpenSolaris 2008.11 we’re also publishing new repositories full of various open source software built and packaged for OpenSolaris:

  • A pending repository with 1,708 FOSS pkgs today, and many more coming. This is “pending” in that we want to promote the packages in it to the contrib repository.
  • A contrib repository with 154 FOSS pkgs today, and many more coming soon.

These packages came from two related OpenSolaris projects in the OpenSolaris software porters community:

The two projects focus on different goals. Here I describe the work that we did on the PkgFactory/Roboporter project. Our primary goal is to port and package FOSS to OpenSolaris as quickly as possible. We do not yet focus very much on proper integration with OpenSolaris, such as making sure that the FOSS we package is properly integrated with RBAC, SMF, Solaris audit facilities, with manpages placed in the correct sections, etcetera, though we do intend to get to the point where we do get close enough to proper integration that the most valuable packages can then be polished off manually, put through the ARC and c-team processes, and pushed to the /dev repository.

Note, by the way, that the /pending and /contrib repositories are open to all contributors. The processes involved for contributing packages to these repositories are described in the SW Porters community pages, so if there’s something you’d like to make sure that your favorite FOSS is included you can always do it yourself!

The 154 packages in /contrib are a representative subset of the 1,708 packages in /pending, which in turn are a representative subset of some 10,000 FOSS pkgs that we had in an project-private repository. That’s right, 10,000, which we built in a matter of just a few weeks. [NOTE: Most, but not all of the 1,708 packages in /pending and 154 in /contrib came from the pkgfactory project.]

The project began with Doug Leavitt doing incredible automation of: a) searching for and downloading spec files from SFE and similar from Ubuntu and other Linux packaging repositories, b) building them on Solaris. (b) is particularly interesting, but I’ll let Doug blog about that. With Doug’s efforts we had over 12,000 packages in a project-private IPS repository, and the next step was to clean things up, cut the list down to something that we could reasonably test and push to /pending and /contrib. That’s where Baban Kenkre and I jumped in.

To come up with that 1,704 package list we first removed all the Perl5 CPAN stuff from the list of 12,000, then we wrote a utility to look for conflicts between our repository, the Solaris WOS and OpenSolaris. It turned out we had many conflicts even withing our own repository (some 2,000 pkgs were removed as a result, if I remember correctly, after removing the Perl5 packages). Then we got down and dirty and did as much [very light-weight] testing as we could.

What’s really interesting here is that the tool we wrote to look for conflicts turned out to be really useful in general. That’s because it loads package information from our project’s repo, the SVR4 Solaris WOS and OpenSolaris into a SQLite3 database, and analyzes the data to some degree. What’s really useful about this is that with little knowledge of SQL we did many ad-hoc queries that helped a lot when it came to whittling down our package list and testing. For example: getting a list of all executables in /bin and /usr/sbin that are delivered by our package factory and which have manpages, was trivial, and quite useful (because then I could read the manpages in one terminal and try the executables in another, which made the process of light-weight testing much faster than it would have otherwise been). We did lots of ad-hoc queries against this little database, the kinds of queries that without a database would have required significantly more scripting; SQL is a very powerful language!

That’s it for now. We’ll blog more later. In the meantime, check out the /pending and /contrib repositories. We hope you’re pleased. And keep in mind that what you see there is mostly result of just a few weeks of the PkgFactory project work, so you can expect: a) higher quality as we improve our integration techniques and tools, and b) more, many, many more packages as we move forward. Our two projects’ ultimate goal is to package for OpenSolaris all of the useful, redistributable FOSS that you can find on Sourceforge and other places.

Technology Underlying the Sun Storage 7000 Series

•November 10, 2008 • Leave a Comment

I’m late to the party. And I don’t have much to blog about my team’s part in the story of the 7000 Series that I haven’t already blogged, most of it about ID mapping, and some about filesystem internationalization. Except of course, to tell you that today’s product launch is very exciting for me. Not only is this good for Sun’s customers (current and, especially, future!) and for Sun, but it’s also incredibly gratifying to see something that one has worked on hard be part of a major product and be depended on by others.

Above all: Congratulations to the Fishworks team and to the many teams that contributed to making this happen. The list of such teams is long. Between systems engineering, Solaris engineering and the business teams that made all this possible, plus the integration provided by the Fishworks team, this is a truly enormous undertaking. Just look at the implausible list of storage protocols spoken by the storage appliance: CIFS, NFS, iSCSI, FTP, WebDAV, NDMP and VSCAN, all backed by ZFS. I’m barely scratching the surface here. It’s not just the storage protocols; for example, DTrace has an enormous role to play here as well, and there are many other examples.

The best part is the integration, the spectacular BUI (browser user interface). No, wait, the best part is the underlying technologies. No, wait!, the best part is the futures. It’s hard to decide what the best aspect of the Sun Storage 7000 series, the story, the people, the technologies, the future, or even what it says about Sun: that Sun Microsystems can innovate and reinvent itself even when the financials don’t look great, even while doing much of the development in the open!

The new storage appliance was a project of major proportions, much of it undertaken in the open. I wonder how many thought that this was typical of Sun, to develop cool technologies without knowing how to tie them together. I hope we’ve shocked you. Now you know: Sun can complete acquisitions successfully and obtain product synergies (usually a four-letter word, that), Sun can do modular development and bring it all together, Sun can detect new trends in the industry (e.g., read-biased SSDs, write-biased SSDs, …) and capitalize on them, Sun can think outside the box and pull rabbits out of its hat. And you better bet: we can and will keep that up.

Observing ID mapping with DTrace

•September 19, 2008 • Leave a Comment

Want to see how idmapd maps some Windows SID to a Unix UID/GID? The idmap(1M) command does provide some degree of observability via the -v option to the show sub-command, but not nearly enough. Try this DTrace script.

The script is not complete, and, most importantly, is not remotely stable, as it uses pid provider probes on internal functions and encodes knowledge of private structures, all of which can change without notice. But it does help a lot! Not only does it help understand operational aspects of ID mapping, but also idmapd’s internals. And, happily, it points the way towards a proper, stable USDT provider for idmapd.

Folks who’ve seen the RPE TOI for ID mapping will probably wish that I’d written this months ago, and used it in the TOI presentation :)

Running the stress tests on idmapd with this script running produces an enormous amount of output, clearly showing how the asynchronous Active Directory LDAP searches and search results are handled.

The compromise on abortion that the Republican mavericks should offer

•September 4, 2008 • 1 Comment

I don’t like blogging about politics. My previous blog entry was the only one I’ve written on politics, and that was about international geopolitics.

But Sarah Palin inspires me.

The culture wars in the U.S. have two major flash points: abortion and gay civil unions/gay marriage.

Abortion is the untractable one. But I believe there is a way, a novel way.

Begin by accepting that, given the structure of the American republic and the Supreme Court’s precedents on abortion, there is no chance that abortion can be made illegal any time soon. Even if Roe vs. Wade, and Casey, and the Court’s other abortion precedents were overturned the issue would merely become a local issue (though it would also stay a national issue), and most States would likely keep the existing regime more or less as is. It will take decades for the pro-life camp to get its way, if it ever did.

That leaves former President Bill Clinton’s formulation of “safe, legal and rare” as the only real option for the pro-life camp. The pro-choice camp’s goal, on the other hand, is pretty safe.

Of course, Bill Clinton never did much, if anything, to make abortion rare. And whatever one might do needs to be sold to the pro-life camp with more than “it’s all you can hope for.”

The solution, then, is to think about the problem from an economics (and demographic) point of view.

Consider: making abortion illegal will not mean zero abortions, for back-alley abortions will return, with the consequent injuries and loss of baby and maternal life. So we can only really hope to minimize abortion. Looked at it this way Bill Clinton’s formula looks really good. This is the argument with which to sell an economics-based solution.

And the solution? Simple: provide financial incentives…

  • …to families to adopt children (though there is already a shortage of children to adopt),
  • to women with unwanted pregnancies to proceed with the pregnancy and put their babies up for adoption,
  • and to abortion clinics to participate in the process of matching women with unwanted pregnancies to families who wish to adopt (effectively becoming market makers — it sounds awful, to market children, but isn’t the alternative worse?).

It sounds like a government program that no fiscal conservative should want taxpayers to pay for. But consider that in the long-term it pays for itself by increasing the future tax base (more babies now -> more adults in the labor pool later). And consider the opportunity cost of not having these children! For Japan- and Russia-style population implosion would have disastrous consequences for the American economy (consider Social Security…). Avoiding population implosion alone should be reason enough to go for such a program. How much to offer as incentives? I don’t know, but even if such a program came to cost $50,000 per-baby that would still be cheap, considering the demographics angle.

So, allow choice, but seek to influence it, with naked bribes, yes, but not coercion (which wouldn’t be “choice”).

This brings us to gay civil unions and/or gay marriage. It’s certainly past the time when any politician of consequence could seriously propose the criminalization of homosexuality in the U.S.; sexual autonomy, at least in the serial monogamy sense, has been a de facto reality for a long time, and now it is de jure. Now, if gay civil unions or marriage could mean more adoptive parents of otherwise-to-be-aborted children, then what can someone who is pro-life do but support at least gay civil unions? If life is the imperative, then surely we can encourage gay couples to help, and let God judge whether homosexuals are in sin or not.

Alright, now that that’s out of the way I hope to go back to my non-politics blogging ways.

Conclusions from the Georgia war

•August 12, 2008 • 21 Comments

Georgia was simply not a defensible route for Europe to energy independence from Russia. Nor could it have been for years more, and because of its remoteness, and unless Turkey wished to have a very active role in NATO (which seems unlikely) then it was bound to stay indefensible for as long as Russia manages to keep up its military (i.e., for the foreseeable future).

Therefore Europe has two choices: become a satellite of Russia, or pursue alternatives to natural gas and oil from Russia.

To save Europe from subservience to Russia will require the development of new energy sources. Geopolitical plays can only work if backed by willingness to use superior military firepower. Europe clearly lacks the necessary military superiority and will-power, therefore only new nuclear power plants, and new non-Russian/non-OPEC oil and gas sources qualify in the short- to medium-term.

So, ramp up nuclear power production (as that’s the only alternative fuel with a realistic chance of producing enough additional power in in the short- to medium-term). And, of course, build more terminals to receive oil and LNG tankers would help.

But any oil/gas to be received by tanker terminals have got to come from somewhere (and Russia’s has got to have an outlet other than Europe). It would help enormously if new oil sources outside OPEC and Russia could be developed, as new friendly supplies would reduce the leverage that Russia has on Europe. That can only be Brazilian, American and Canadian oil.

Does Europe have the fortitude to try? Does the U.S. have the leverage to get Europe to try?

The big loser here is Europe. Europe now has to choose whether to surrender or struggle for independence. The U.S. probably can’t force them. A European surrender to Russia will be slow, and subtle, but real. If Europe surrenders then NATO is over. Funny, that Russia is poised to achieve what the Soviet Union could not. But it isn’t funny. And I suspect few citizens of Europe understand, and few that do object; anti-Americanism may have won.

The only thing Europe has going for it is that there is much less NIMBYist resistance to nuclear power there than in the U.S. Also, awareness that a power crunch is at hand, and a much more severe one probably coming is starting to sink in around the world (drilling for oil everywhere is now very popular in the U.S., for example, with very large majorities in favor; support for new nuclear power plants is bound to follow as well).

As for the environment, I don’t for a second believe in anthropogenic global warming, but ocean acidification is much easier to prove, and appears to be real, and is much, much more of an immediate and dire threat to humans than global warming. Regardless of which threat is real, and regardless of how dire, there’s only one way to fight global warming/ocean acidification: increase the wealth of Earth’s nations, which in the short-term means producing more energy. American rivers were an environmental mess four decades ago, but today the U.S. is one of the cleanest places on Earth. The U.S. cleaned up when its citizens were rich enough that they could manage to care and to set aside wealth for cleaning things up. It follows that the same is true for the rest of the world, and if that’s not enough, consider what would happen if the reverse approach is followed instead: miserable human populations that will burn what they have to to survive, the environment be damned.

Let us set on a crash course to develop new energy sources, realistic and practical ones, and let us set on a course to promote and develop international commerce like never before.

Can we map IDs between Unix domains? (e.g, for NFSv4)

•June 13, 2008 • 1 Comment

Today (onnv build 92), no.

But there’s no reason we couldn’t add support for it.

Here’s how I would do it:

  • First, map all UIDs and GIDs in foreign Unix domains to S-1-22-3-<domain-RIDs>-<UID> and S-1-22-4-<domain-RIDs>-<UID>. Whence the domain RIDs? Preferably we’d provide a way for each domain to advertise a domain SID. Otherwise we could allow each domain’s SID to be configured locally. Or else derive it from the domain’s name, e.g., octet_string_to_RIDs(SHA_256(domain_name)).
  • Second, map all user and group names in foreign Unix to <name>@<domain-name>
  • Third, use libldap to talk to foreign Unix domains with RFC2307+ schemas. Possibly also add support for using NIS. (Yes, the NIS client allows binding to multiple domains, though, of course, the NIS name service backend uses only one; the yp_match(3NSL) and related functions take an optional NIS domain name argument.)

This would require changes to idmapd(1M). I think the code to talk to foreign Unix domains and cast their IDs into our local form should be easy to compartmentalize. idmapd would have to learn how to determine the type of any given domain, and how to find how to talk to it — this is going to be what most of the surgery on idmapd would be about.

I don’t know when we might get to this. Maybe an enterprising member of the community could look into implementing this if they are in a hurry.

(destructuring-bind) for XML

•January 11, 2008 • Leave a Comment

XPATH

Plus ça change…

More on the design and implementation of Solaris’ ID mapping facility, part 1: kernel-land

•November 13, 2007 • 1 Comment

UPDATE: The ZFS FUID code was written by Mark Shellenbaum. Also, something someone said recently confused me as to who came up with the idea of ephemeral IDs; it was Mike Shapiro.

Now that you know all about ephemeral IDs and ID mapping, let’s look at Solaris ID mapping more closely. Afshin has a great analogy to describe what was done to make Solaris deal with SMB and CIFS natively, you should not miss it.

Let’s begin with how the kernel treats Windows SIDs and ID mapping.

[Note: the OpenSolaris source code web browser interface cannot find the definitions of certain C types and functions, so in some places I’ll link to files and line numbers. Such links will grow stale over time. If and when the OpenSolaris source browser interface is fixed I may come back to fix these links.]


SIDs in the kernel

First we have
$SRC/uts/common/os/sid.c
.
Here you can see that the kernel does not use the traditional SID structure or wire encoding. Instead Solaris treats SIDs as ksid_t objects consisting of an
interned
domain SID (represented by ksiddomain_t) and a uint32_t RID. The prefix is just the stringified form of the SID (S-1-<authority>-<RID0>-<RID1>-…<RIDn>) up to, but excluding the last RID.

Treating SIDs as a string prefix and an integer RID is a common thread running through all the Solaris components that deal with SIDs, except, of course, where SIDs must be encoded for use in network protocols. Interning is used where space or layout considerations make small, fixed-sized objects preferable to variable-length SID structures, namely: in the kernel and on-disk in ZFS.

The ksidlookupdomain() function takes care of interning SID prefixes for use in-kernel. The interned SID prefix table is just an AVL tree, naturally.

The SIDs of a user are represented by credsid_t,
which contains three SIDs plus a list of SIDs that is akin to the supplementary group list. credsid_t objects are reference counted and referenced from cred_t. This is done because the Solaris kernel copies cred_t objects quite often, but a cred_t’s SID list is simply not expected to change very often, or even ever; avoiding unnecessary copies of potentially huge SID lists (users with hundreds of group memberships are common in Windows environments) is highly desirable. The crdup() function and friends take care of this.

Back to sid.c for a moment, lookupbyuid() and friends are where the kernel calls the idmap module to map SID<->UIDs. But we’ll look at the kernel idmap module later.

Note that not all ephemeral IDs are valid. Specifically, only ephemeral IDs in ranges allocated to the running idmapd daemon are considered valid. See the VALID_UID() and VALID_GID() macros. Kernel code needs to be careful to allow only non-ephemeral UIDs/GIDs in any context where they might be persisted across reboots (e.g., UFS!), or to map them back to SIDs (e.g., ZFS!); in all other cases kernel code should be checking that any UIDs/GIDs are valid using those macros. The reason that the VALID_UID/GID() checks are macros should be instantly clear to the reader: we’re optimizing for the expected common/fast case where the given ID is non-ephemeral, in which case we can save a function call. Wherever neither SIDs nor ephemeral IDs can be used the kernel must substitute suitable non-ephemeral IDs, namely, the ‘nobody’ IDs — see crgetmapped(), for example.

Can you spot the zones trouble with all of this? All this code was built for global-zone only purposes due to time pressures, though we knew that eventually we’d need to properly virtualize ephemeral IDs and ID mapping. Now that we have a zoned consumer (the NFSv4 client, via nfsmapid(1M)), however, we must virtualize ID mapping so that each zone can continue to have its own UID/GID namespace as usual. The fix is in progress; more details below.

BTW, the sid.c, cred.c code and related headers was designed and written by Casper Dik.


SIDs in ZFS

Next we look at how ZFS handles SIDs.

Take a look at $SRC/uts/common/fs/zfs/zfs_fuid.c. This is where FUIDs are implemented. A FUID is ZFS’s way of dealing with the fact that SIDs are variable length. Where ZFS used to store 32-bit UIDs and GIDs it now stores 64-bit “FUIDs,” and those are simply a {<interned SID prefix>, <RID>} tuple. Traditional POSIX UIDs and GIDs in the -..2^31-1 range are stored with zero as the interned SID prefix. The interned SID prefix table, in turn, is stored in each dataset.

Here too we see calls to the idmap kernel module, but again, more about that below. And you can see that ZFS keeps a copy of the FUID table in-kernel as an AVL tree (boy, AVL trees are popular for caches!).

If I understand correctly, the ZFS FUID code was written by Mark Shellenbaum. The idea for FUIDs came from Afshin Salek. I’m not sure who thought of using the erstwhile negative UID/GID namespace for dynamic, ephemeral ID mapping.

And you can also see that we have some zone issues here also; these will be addressed, as with all the zone issues mentioned here, in a bug fix that is currently in progress.

I’ll leave VFS/fop and ZFS ACL details for another entry, or perhaps for another blogger. The enterprising reader can find the relevant ARC cases and OpenSolaris source code.


The idmap kernel module

Finally we look at the idmap kernel module. This module has several major components: a lookup cache, the basic idmap API, with door upcalls to idmapd, and idmapd registration/unregistration.

The idmap kernel module is fairly straightforward. It uses ONC RPC over doors to talk to idmapd.

Unfortunately there is no RPC-over-doors support in the kernel RPC module. Fortunately implementing RPC-over-doors was quite simple, as you can see in kidmap_rpc_call(). The bulk of the XDR code is generated by rpcgen(1) from the idmap protocol .x file. The code in $SRC/uts/common/idmap/idmap_kapi.c is mostly about implementing the basic ID mapping API.

The module’s cache is, again, implemented using an AVL tree. Currently the only way to clear the cache is to unload the module, but as we add zone support this will no longer work, and we’ll switch, instead, to unloading the cache whenever idmapd exits cleanly (i.e., unregisters), which will make it possible to clear the cache by stopping or restarting the svc:/system/idmap service. Also, we’ll be splitting the cache into several to better support diagonal mapping.

Finally, I’ll briefly describe the API.

The ID mapping APIs are designed to batch up many mapping requests into a single door RPC call, and idmapd is designed to batch up as much database and network work as possible too. This is to reduce latency in dealing with users with very large Windows access tokens, or ACLs with many distinct ACE subjects — one door call for mapping 500 SIDs to POSIX IDs is better than 500 door calls for mapping one SID to a POSIX ID. The caller first calls kidmap_get_create() to get a handle for a single batched request, then the caller repeatedly calls any of the kidmap_batch_get*by*() functions to add a request to the batch, followed by a call to kidmap_get_mappings() to make the upcall with all the batched requests, or the caller can abort a request by calling kidmap_get_destroy(). APIs for non-batched, one-off requests are also provided. The user-land version of this API can also deal with user/group names.

The idmap kernel module was written mostly by Julian Pullen (no blog yet).


A word about zones

As I mentioned above, we need to virtualize the VALID_*ID() macros, underlying functions, and some of the ksid_*() and zfs_fuid_*() functions. We’re likely going to add a zone_t * argument to the non-batch kidmap API functions and to kidmap_get_create(), as well as to the VALID_*ID() macros, related functions, and affected ksid_*() functions. The affected zfs_fuid_*() and cr*() functions already have a cred_t * argument (or their callers do), from which we can get a zone_t * via crgetzone(). The biggest problem is that it appears that there exists kernel code that calls VOPs from interrupt context(!), with a NULL cr, so that we’ll need a way to indicate that the current zone is not known (or, if in SMB server context, that this is the global zone); the idmap kernel module will have to know to map IDs to the various nobody IDs (including Nobody SID) when no zone is identified by the caller.

In the next blog entry I’ll talk about the user-land aspects of Solaris ID mapping.