A few weeks ago there was a brief sub-thread on the networking-discuss OpenSolaris list about whether the new proposed kernel sockets API project should not begin by delivering only a synchronous API. That suggestion was quickly dismissed, fortunately.
IMO all APIs that can block on I/O should be asynchronous.
Even APIs that can “block” only on lengthy computation (e.g., crypto) should be async, as such computation might be offloaded to helper devices (thus getting I/O back into the picture) or the call to such an API might fork a helper thread (think automatic parallelism, which one might want on chip multi-threading (CMT) systems) if that is significantly lighter-weight than just doing the computation.
For example, gss_init_sec_context(3GSS) should have an option to work asynchronously, probably using a callback function to inform the application of readiness.
And open(2), creat(2), readdir(3C), and so on should all be asynchronous. If all filesystem-related system calls had async versions then one could build file selection combo box widgets that are responsive even when looking for files in huge directories (the user would see the list of files grow until the whole thing was read in or the file the user was interested in appears, as opposed to having to wait to see anything at all) and which don’t need to resort to threading to achieve the same effect. And the NFS requirement that operations like file creation must be synchronous would not penalize clients that support async versions of creat(2) and friends.
Of course, adding async versions of all filesystem-related system calls without resorting to per-call worker threads probably means radical changes to the VFS and the filesystem modules. Which should prove the point: it’s much easier to layer sync interfaces atop async ones than it is to rework the implementation of sync interfaces to support async ones efficiently, so one should always start by implementing async interfaces first.
It’s been decades since GUI programming taught us that everything must be async. And even web applications, which used to be synchronous because of the way HTML and browsers worked, work asynchronously nowadays — async is what wikipedia!”Ajax”{Ajax_%28programming%29} is all about.
Really, we should all refrain from developing and delivering any more synchronous blocking APIs without first delivering asynchronous counterparts.
BTW, wikipedia!”closures”{Closure_(computer_science)} are a wonderful bit of syntactic sugar to have around when coding to async APIs — they let you define wikipedia!”callback”{Callback_(computer_science)} function bodies lexically close to where they are referenced.
Given a really cheap way to test for the availability of a thread in a CMT environment, and a really cheap way to start one, then all those callback invocations (closure invocations) could be done on a separate thread when one is available.
I like to think of async programming as light-weight threading because I like to think of closures and continuations as light-weight threading. Continuations built on closures and continuation-passing-style (CPS) conversion, in particular (i.e., putting activation records on the heap, rather than on a stack), are a form of very light-weight cooperative multi-threading (wikipedia!”green threads”{Green_threads}): thread creation and context switching between threads has the same cost as calling any function. The trade-off when putting activation records on the heap is that much more garbage is created that needs to be collected — automatic stack-based memory allocation and deallocation goes out the window. A compromise might be to use delineated continuations and allocate small stacks, with bounds checking rather than guard pages used to deal with stack growth. VM manipulation to setup guard pages and the large stacks needed to amortize this cost are, I suspect, some of the largest costs in traditional thread creation as compared to heap allocation of activation records.
Another reason to think of async as light-weight threading is that a workaround for a missing, but needed, async version of a sync function is to create a worker thread to do the sync call in the background and report back to the caller when the work is done. Threads are fairly expensive to create. At the very least async interfaces allow the implementation cost to be less obvious to the developer and leave more options to the implementor (who might resort to forking a thread per-async call if they really want to).
Finally, pervasive async programming looks a lot like CPS code, which isn’t exactly pleasant. Too bad continuations haven’t made it as a mainstream high level language feature.
Async, async, everywhere
•May 22, 2007 • Leave a Comment.safe TLD? Probably a bad idea
•April 10, 2007 • Leave a CommentF-Secure proposes a .safe TLD.
How would a global registrar be able to vet a request from a bank in, say, Nigeria? You’d think that ccTLD registrars would be in a much better position to see to it that local registrants are vetted according to local regulations, which argues for a .safe.cc.
Why just financial institutions? After they move to .safe all the other senstive services left outside .safe will become targets, so why not move all medical providers, shopping sites, etcetera to .safe too?
Confusable bank names occur in the world of brick-and-mortar anyways, and those cause problems in the Internet, so how is .safe to avoid confusables?
And so on. I think this is a bad idea. On the other hand, establishing a precedent that registrars can do better would be good!
C shell pushd/popd on steroids as Korn Shell functions
•April 5, 2007 • 1 CommentBlogfinger inspires me to post some of my crazy KSH function code. The code below is a heavily hacked version of a similar functions that I got from Will several years ago. Someday I should post my partial ASN.1 DER encoder/decoder written in KSH :)
Syntax highlighting courtesy of VIM and its :TOhtml feature.
typeset|grep '^integer cdx$' > /dev/null || integer cdx=0 typeset|grep '^integer cdsx$' > /dev/null || integer cdsx=0 function cdhelp { cat <<-"EOF" Usage: cdinit [<file>] Usage: cdcl Initialize directory list and stack. Clear directory list and stack. Usage: cdfind <path> [quiet] Usage: cdgrep <partial path> [first] Find a directory, by exact match in the cd list. Find matching directories in the cd list. Usage: cdls Usage: cdrf [-a|--append] [<files>|-] Show the cd list. Use this to save your cd list. Read in a cd list and replace the current cd list. Usage: cdsv [<directory paths>] (if none given then CWD) Save the given or current directory to the cd list. Usage: cdow <index> Overwrite the given entry in the cd list with the CWD. Usage: cdto <index>|<path>|[+]<partial path> Change the current directory to an entry from the cd list, or, if a path is given, then change to the given path and save to the cd list. Usage: cdrm [<path>|<index>] Remove a directory, or the current directory from the cdlist. Usage: cdsort Sort the cd list (alphanumerically) Usage: pushd [<directory>] Usage: popd [<number>] Usage: rightd [<number>] (reverse of popd) Usage: dirs (shows dirs in pushd/popd stack) Directory stack, similar to the C-Shell built-ins of similar names. EOF } function cdfind { typeset p dir integer i if [[ $# -lt 1 || -z "$1" ]] then print "Usage: cdfind <path> [quiet]" return 5 fi i=0 p="$1" [[ "$1" != /* ]] && p="$PWD/$1" for dir in "${cdlist[@]}" do if [[ "$p" = "$dir" ]] then if [[ "$2" != quiet ]] then print "Found at index $i" return 0 fi return $i fi i=i+1 done return 255 } function cdgrep { typeset i s integer i=0 s=1 if [[ $# -lt 1 || $# -gt 2 || -z "$1" ]] then print "Usage: cdgrep <partial path> [first]" fi while [[ i -lt cdx ]] do if eval "[[ \"${cdlist[i]}\" = *${1}* ]]" then print "$i ${cdlist[i]}" s=0 [[ "$2" = first ]] && return 0 fi i=i+1 done return $s } function cdshow { typeset found_at cdfind "$PWD" quiet found_at=$? [[ $found_at -eq 255 ]] && found_at='[unsaved]' print "$found_at $PWD" return 0 } function cdto { typeset i j dir integer i if [[ $# -ne 1 || -z "$1" ]] then print "Usage: cdto <index>|<path>|[+]<partial path>" fi if [[ "$1" = +([0-9]) ]] then if cd "${cdlist[$1]}" then print "$1 ${cdlist[$1]}" return 0 fi else if [[ "$1" != \+* && -d "$1" ]] then if cd "$1" then cdsv > /dev/null cdshow return 0 fi fi cdgrep "${1#\+}" first|read j dir if [[ -d "$dir" ]] then if cd "$dir" then cdsv > /dev/null cdshow return 0 fi fi fi print "Could not cd to $1" return 1 } function cdls { integer i=0 while [[ i -lt cdx ]] do print $i ${cdlist[i]} i=i+1 done } cdcl () { cdx=0 cdsx=0 unset cdlist unset cdstack set -A cdlist set -A cdstack } function cdsv { typeset dir current integer i if [[ "$1" = -h || "$1" = --help ]] then print "Usage: cdsv [<directory paths>] (if none given then CWD)" return 1 fi # Look for $PWD in cdlist[] current="" if [[ $# -eq 0 ]] then current="current " set -- "$PWD" fi for dir in "$@" do cdfind "$dir" quiet i=$? if [[ $i -ne 255 ]] then print "The ${current}directory $dir is already in the cdlist ($i)" continue fi #cdlist[${#cdlist[@]}]="$PWD" cdlist[cdx]=$PWD cdx=cdx+1 done } # overwrite entry function cdow { integer i if [[ $# -ne 1 || "$1" != +([0-9]) ]]; then print "Usage: cdow index#(see cdls)" return 1 fi cdfind "$PWD" quiet i=$? if [[ "$1" -gt ${#cdlist[@]} ]] then print "Index is beyond cdlist end ($1 > ${#cdlist[@]})" return 1 fi if [[ $i -ne 255 ]] then print "The current directory is already in the cdlist ($i)" return 1 fi cdlist[$1]=$PWD return 0 } function cdrf { typeset f status dir spath typeset cdx_copy cdlist_copy usage integer cdx_copy=0 usage="Usage: cdrf [-a|--append] [<files>|-]" status=1 set -A cdlist_copy -- # Options while [[ $# -gt 0 && "$1" = -?* ]] do case "$1" in -s|--strip) spath=$2 shift ;; -a|--append) cdx_copy=$cdx set -A cdlist_copy -- "${cdlist[@]}" ;; *) print "$usage" return 1 ;; esac shift done # Default to reading stdin [[ $# -eq 0 ]] && set -- - # Process cdlist files for f in "$@" do if [[ "$f" = - ]] then # STDIN sed -e 's/^[0-9]* //' | while read dir do dir=${dir#$spath} [[ "$dir" != /* ]] && dir="$PWD/$dir" cdlist_copy[cdx_copy]=${dir} cdx_copy=cdx_copy+1 done status=0 continue fi # Find the cdlist file if [[ ! -f "$f" && ! -f "$HOME/.cdpath.$f" ]] then print "No such cdpath file $p or $HOME/.cdpath.$f" continue fi [[ ! -f "$f" && -f "$HOME/.cdpath.$f" ]] && f="$HOME/.cdpath.$f" # Read the cdlist file sed -e 's/^[0-9]* //' "$f" | while read dir do dir=${dir#$spath} [[ "$dir" != /* ]] && dir="$PWD/$dir" cdlist_copy[cdx_copy]=${dir} cdx_copy=cdx_copy+1 done status=0 done [[ $status -ne 0 ]] && return $status # Install new cdlist cdcl cdx=$cdx_copy set -A cdlist -- "${cdlist_copy[@]}" return 0 } function cdsort { cdls | sed -e 's/^[0-9]* //' | sort -u | cdrf - cdls } function cdrm { integer i if [[ -n "$1" && "$1" != +([0-9]) ]] then cdfind "${1:-$PWD}" quiet i=$? elif [[ -n "$1" && "$1" = +([0-9]) ]] then i=$1 elif [[ $# -ne 0 ]] then print "Usage: cdrm <path>|<index>" return 1 else cdfind "${1:-$PWD}" quiet i=$? fi if [[ "$i" -eq 255 ]] && return 1 then cdfind "${1:-$PWD}" fi cdls|grep -v "^${i} "|sed -e 's/^[0-9]* //'|cdrf i=$? cdls return $i } function pushd { if [[ $# -gt 0 && ! -d "$1" ]] then print "Can't cd to: $1" return 1 fi cdstack[$cdsx]="$PWD" if cd "${1:-.}" then cdsx=cdsx+1 cdstack[cdsx]="$PWD" [[ "$2" = sv || "$2" = save ]] && cdsv cdshow else print "Could not cd to $1" return 1 fi return 0 } function pushdsv { pushd "$1" save } function popd { if [[ $((cdsx-${1:-1})) -lt 0 || $((${#cdstack[@]} - ${1:-1})) -lt 0 ]] then print "Empty stack or popping too much" return 1 fi if [[ ! -d "${cdstack[cdsx-${1:-1}]}" ]] then print "Can't cd to: ${cdstack[cdsx-${1:-1}]}" return 2 fi if cd "${cdstack[cdsx-1]}" then cdsx=cdsx-1 cdshow else print "Could not popd to ${cdstack[cdsx-1]}" return 1 fi return 0 } function dirs { integer i if [[ ${#cdstack[@]} -eq 0 ]] then print "Empty stack" return 1 fi i=1 print -n "${cdstack[0]}" while [[ $i -le $((cdsx)) && $i -le ${#cdstack[@]} ]] do print -n " ${cdstack[i]}" i=i+1 done if [[ ${#cdstack[@]} -gt $cdsx && $i -lt ${#cdstack[@]} ]] then print -n " <-> " while [[ $i -lt ${#cdstack[@]} ]] do print -n " ${cdstack[i]}" i=i+1 done fi print return 1 } function rightd { typeset i integer i if [[ -n "$1" && "$1" != +([0-9]) ]] then print "Usage: rightd [<number>]" return 1 fi i=${1:-1} if [[ ${#cdstack[@]} -le $((cdsx+i)) ]] then print "No directories to the right on the stack" return 1 fi if [[ ! -d "${cdstack[cdsx+i]}" ]] then print "Can't cd to: ${cdstack[cdsx+i]}" return 2 fi if cd "${cdstack[cdsx+i]}" then cdsx=cdsx+i cdshow else print "Could not cd to ${cdstack[cdsx+i]}" return 1 fi return 0 } cdinit () { [[ -n "${recd}" ]] && return 0 cdcl if [[ $# -eq 0 && -f ~/lib/cdpaths ]] then cdrf < ~/lib/cdpaths elif [[ $# -eq 1 ]] then cdrf "$1" fi cdls } vicd () { ${EDITOR:-vi} ~/.kshcd } recd () { typeset recd recd=inprogress . ~/.kshcd recd="" } cdinit
Neptune and IPsec
•March 30, 2007 • Leave a CommentThat new NIC of ours rocks. Its best feature: incoming packet classification offload. “What?” you ask? Neptune can route incoming packets the CPUs most closely associated with the packet flows to which those incoming packets belong — and this means lower latency because of hotter caches. Compris?
This classification works on 5-tuples (or hashes thereof), of course: source and destination addresses, next protocol (e.g., TCP, UDP, SCTP), source and destination port numbers.
Curiously absent from the data sheet: IPsec. So I asked and I found out: Neptune can classify just as well by IPsec SA SPI as by plaintext 5-tuples. Of course, so can Solaris, therefore Neptune, Niagara and Solaris fit together well, IPsec or no IPsec.
Excellent.
Google’s J2ME apps
•March 27, 2007 • Leave a CommentJonathan Schwartz raves about Google Maps on his Blackberry. I couldn’t agree more. Two weeks ago I depended on it heavily (running on my Samsung SPH900) to get around in southern Florida, where I went on vacation. It found locations, plotted routes, found gas stations near where I was, all very quickly, and easily. Wow. I also use Google’s GMail J2ME application for my personal e-mail (actually, this is the primary reason I use gmail, that I get to use it on my cell phone with a wonderful UI).
IPsec APIs
•March 27, 2007 • Leave a CommentCurrently IPsec lacks a notion of APIs. By and large the world of IPsec functions on fairly static system configuration: an authentication and authorization database for the key exchange protocol(s) and a packet filtering type of policy.
Because networks aren’t static and because static configurations are difficult to manage we end up with, in practice, the use of rules that use wildcards (or moral equivalents) that reduce the security of the overall system somewhat, such as “all systems with a certificate that can be validated to this trust anchor can claim any IP address from the following blocks of addresses” — add wireless networking, note that IPsec protects packets rather than packet flows and such rules get weaker as the set of applicable peers grows larger.
Worse, application protocols that could benefit from IPsec that cannot simply say “use IPsec” (that is, most application protocols whose authors would like to rely on IPsec) cannot rely on IPsec at all, since IPsec is as a black box to them.
IPsec APIs could help drive the use of IPsec.
Solaris has some IPsec APIs (see previous post) and supports “connection latching” to protect packet flows, not just individual packets. It needs more. Specifically it needs to include support for specifying the desired name of a peer and support for discovering the actual phase 1 peer IDs for a given latched connection.
And we need a standard abstract IPsec API so that more standard Internet protocols can make use of IPsec.
Fortunately the IETF has a working group chartered to work on such an API: the BTNS (Better Than Nothing Security) WG. The name of that WG is a bit unfortunate: it reflects but one intended use of just one of its core work items (unauthenticated IPsec), but it has other work items that are needed to make that core item particularly useful, and those other work items happen to be applicable to IPsec in general: connection latching (protecting entire packet flows, not just individual packets) and IPsec APIs.
On Channel Binding
•March 27, 2007 • Leave a CommentMy Internet-Draft on channel binding is in IETF Last Call. Those of you interested in the topic should go review it. Those who are not aware of this topic but are interested in cryptographic protocols should review it as well. Comments should be sent to ietf at ietf.org and should cc me (my e-mail address is on the document).
So what is channel binding and what’s it for? It’s a way to cryptographically bind end-to-end authentication at the application layer to a secure channel at a lower layer. This cryptographic binding is a way to eliminate MITMs in that secure channel. It is particularly useful to applications that intend to rely on TLS or IPsec for session/transport security.
Channel bindings are also stimulating the development of APIs for IPsec and an unauthenticated mode of IPsec. Without such APIs it is very difficult for application protocols to rely on IPsec. See the IETF BTNS Working Group charter page and presentations made at past IETF meetings of the BTNS WG (see IETF proceedings; the latest ones are here).
Interestingly, Solaris already has a modicum of an IPsec interface in the form of the which, incidentally, relies on “connection latching.” Connection latching is described in the ipsecconf(1M) man page and in a BTNS WG Internet-Draft. The IPsec APIs that the BTNS WG is working on amount to adding fields, if you wish, to the IP_SEC_OPT socket option that deal with local and peer node naming (though that’s not necessarily how Solaris will implement such an extension — the C bindings of the proposed API deal in opaque types and constructor/accessor/destructor functions).
Filesystem I18N
•December 15, 2006 • 2 CommentsThe FS I18N Problem
So you’re a global company, with users in many countries, speaking many languages and, therefore, using various locales. Your users travel, and they communicate with each other, sharing files in project shares, etcetera.
Their documents can contain non-US-ASCII text just fine, depending on the applications they use (say, StarOffice).
Their filenames, on the other hand, cannot. That’s because in the world of POSIX filesystems are 8-bit clean (filenames can contain any byt values other than 0x0 (NUL) and 0x2F (‘/’). “8-bit clean” is foul language in the world of internationalization: it typically means the system doesn’t track what codeset is used for what strings as it’s all just a bunch of bytes. And that’s exactly how POSIX systems deal with filenames.
If all users use locales with the same codeset then users should never see garbage for filenames. Since Solaris has lots of UTF-8 locales nowadays you can, in fact, have your users all use UTF-8 locales.
But there is legacy to worry about:
- legacy filesystem content
- legacy clients and servers
- legacy habits
- legacy rendering engines
and the interoperability problems that arise from legacy.
So you can prohibit use of non-UTF-8 locales, and do your best to clean up non-UTF-8 filesystem content. For now this is the best answer.
But it doesn’t get OS engineers off the hook entirely. There are several things that we need to worry about, or that we can do.
The OS Could do More
Where possible we ought to do codeset conversions automatically. That’s harder than it sounds, but not impossible.
And we need to worry about Unicode normalization.
An ASCII art picture seems appropriate right now:
+----------------------+ +----------+ |POSIX app process | | NFS clnt | | (user-land) | +----------+ | -------------------- | ^ | libc stubs | | | ^ | | | | | | +-------|--------------+---------|----+ | | kernel-land v | |+------|-------------+ +----------+ | ||POSIX . system calls| |NFS server| | |+------.-------------+ +-------^--+ | | . | | | . | | | +-----v------------------------v-+ | | | VFS | | | | +-------------+ +-------+ | | | | | VOP | |DNLC | | | | | |-------------|<...>| ^ | | | | | | fop | +---|---+ | | | | +-------------+ | | | | +-------------------------|------+ | | ^ v | | FS Modules | (ZFS, UFS, ...) | | | | | v | | filesystem instance | +-------------------------------------+
Most of the components shown in that picture in most POSIX OSes, Solaris included, are blissfully unaware of codesets, encodings and normalization. Strings representing filesystem object names are certainly not tagged with codeset/encoding information.
NFSv4 [RFC3530] does say “thou shall use UTF-8 for filesystem object names” (paraphrase). But most clients and servers do not enforce this. Legacy NFSv2/3 clients and servers certainly don’t — they never had to.
If we wanted to introduce automatic codeset conversions into this picture we’d have to find boundaries where there is knowledge of what codesets are expected on either side of the boundary. No such boundaries exist in that figure… unless, that is, we define some conventions.
If we declare “thou shall use UTF-8 in the middle” then we can quickly find appropriate boundaries for codeset conversion:
- libc knows what locale is in use in user-land and now would know that UTF-8 is expected by the kernel given a UTF-8-in-the-middle convention, so libc syscall stubs could perform codeset conversions
- NFSv4 clients know that servers should expect UTF-8, and they should know what local applications expect (see previous bullet), so, NFSv4 clients can perform whatever codeset conversions they wish
- NFSv4 servers can enforce use of UTF-8 and, as courtesy, could perform codeset conversions for legacy clients when they know about them (how would they know? via out of band configuration most likely)
- filesystem modules can perform codeset conversions too (e.g., you could declare that /export/foo allows only names encoded in ISO8859-15), or encoding conversions (e.g., NTFS wants UCS-2/UTF-16)
Of course, if you’re not in a UTF-8 locale codeset conversions will only just decrease the amount of garbage the user might see, and, more importantly, the amount of garbage the user can create. But won’t get rid of all opportunities for garbage (how does one represent kanji characters in ISO8859-1? right, one does not). For a small improvement users would pay what could be a large performance cost. As long as they don’t create non-UTF-8 names all should be OK… So we at least need an option to exclude non-UTF-8 names from the filesystems.
Finally, Unicode Normalization
Having solved the codeset conversion problem (ha!) we can now look at normalization.
Check this out:
solaris-client% touch /net/macos-server/foo/á solaris-client% cp -r /net/macos-server/foo /tmp solaris-client% cat /tmp/foo/á cat: cannot open /tmp/foo/á solaris-client%
What happened? Well, I entered a-with-acute in my gnome-terminal and the input method produced the composed LATIN SMALL LETTER A WITH ACUTE
codepoint (U+00E1) codepoint. But Mac OS X normalized to NFD — that is, it decomposed this to U+0061 (ASCII ‘a’) U+0301 (COMBINING ACUTE ACCENT
). When I copied that file to Solaris I copied the decomposed name.
And you can see what happens then: I enter a filename that I think ought to match the file’s actual name, and looks like it does, and should match it, but in fact does not!
Here we have Unicode’s ability to represent compositions in more than one equivalent way combining with an 8-bit clean system to punish the user. If the application had been a GUI, with a file selection combo box, then chances are that I wouldn’t notice any problems as long as I clicked on the file I wanted, but let me type its name and things break.
Most operating systems out there, Windows and Solaris included, just-don’t-normalize. Because typical input methods produce pre-composed codepoints noone notices any problems. But Mac OS X does normalize: it normalizes filenames given as inputs to LOOKUP and CREATE operations, and it normalizes to a form (NFD) that is different from that of typical input methods on other operating systems.
So, what to do?
We could take the Mac OS X approach: normalize on LOOKUP and CREATE, possibly to NFC instead of NFD (to better match display capabilities on Solaris renderers).
Or we could choose to be normalization-insensitive on LOOKUP and normalization-preserving on CREATE.
The latter interops best, but is also more expensive. It’s also more correct — we don’t have to worry about applications that do silly things like CREATE then READDIR and look for the thing created. Fortunately we can fast-path processing of ASCII names.
Then again, normalization-insensitiveness has some complications, as it’s not enough to have primitives for comparing strings without regard to composition. There are places where the system hashes strings, such as in the DNLC (directory name lookup cache), and we may not want to be normalizing entire strings there as that would involve memory allocation. So we might need a primitive to normalize strings in small incremental steps, so hash functions can normalize their string inputs without having to memory.
In closing I should point out that these two approaches to dealing with normalization both assume that strings that hit the filesystem are already in UTF-8, that to address normalization we must first establish an I18N convention as described above.
Building filesystem servers in user-land for protocols like AFS or NFSv4
•December 15, 2006 • Leave a CommentA friend of mine who dabbles (to put it lightly) on AFS was telling me this weekend about a problem he’d had to contend with.
AFS, like NFSv4, can multiplex multiple users on the client side onto one connection, but unlike NFS servers, AFS servers [OpenAFS] are typically built to run in user-land. Because of this and because of differences in filesystem semantics between POSIX (which NFS approximates) and AFS, AFS servers typically implement not only the server but the whole filesystem in user-land, completely avoiding the underlying OS’ own filesystems.
So one cannot normally share non-AFS filesystems with AFS, and one cannot use AFS filesystems except through the AFS protocol. But my friend Jeff Hutzelman needed to do just that, so he built a prototype to demonstrate that it was possible. [UPDATE: an HTTP URL for hostafs http://www.cs.cmu.edu/afs/cs.cmu.edu/project/systems-jhutz/hostafs/]
This reminded me of what used to be a pet project of mine for Solaris that has not panned out. There is a class of applications for which a feature of Windows and some other other OSes, per-thread credentials, can be very useful. Namely, multi-threaded server applications running in user-land that need to perform certain operations as though they were performed by a client user process, not the server process. File servers like OpenAFS are a good example of this. Now, OpenAFS doesn’t actually need such a feature: its filesystem is implemented in user-land as well, so there’s no need for the kernel to provide the server with such a facility.
Well, my team had what I thought was a use case for this facility in Solaris, and so I tried to get a project going to add per-thread credentials support to Solaris. Long story short: there are too many little and big semantics and backwards compatibility problems to resolve, so we had to find an alternative solution. Of course, alternatives existed, and we settled for one: fork() separate worker processes, one per-client user, from a privileged one, adopt the credentials of the client users in the child processes, and run the operations in question in the worker processes.
Now, I’m blogging this a) because I think the problems we ran into when with per-thread credentials are interesting, b) to explore the alternatives and design patterns.
I may blog about (a) in-depth in the future. For now suffice it to say that the problems range from “what credentials will signal handlers, shared object .init sections, etc… run with?” to “what happens if a library, unbeknownst to the application, creates threads from a thread running with different credentials than the process itself, but the library also doesn’t know about per-thread credentials?”
Some of these have relatively simple answers, others cannot be resolved with ease or at all. When you mix multiple security contexts in one VM address space you have to be very careful. In particular you have to make sure that every component of the environment knows about this and follows agreed upon protocols. Multi-user OS kernels like Solaris and Linux are examples of programs that work like that (but then, they’re not user-land programs).
But the world of multi-threaded Unix user-land has always assumed a single credential used by all threads in the same process (setuid(2) affects all threads in the caller’s process, though it does not impact in-progress system calls), and violating this assumption could cause lots of security bugs when using code written when that assumption was valid; the system is too open and too large to make it safe to violate this assumption (except for the simplest of programs). As the enormity of the problem of making this safe for the use case that we had in mind sank in we simply had to settle for more orthodox alternatives.
As for (b), I’ve written a small prototype of a library that implements: a worker process abstraction and variants of system calls that take an argument specifying a user to run the system call as (actually, the prototype does this only for open(2): doas_open()).
Additionally, for the fun of it and to explore some of the problems in (a), I have an LD_PRELOADable shared object that implements an illusion of per-thread credentials.
This sort of facility (e.g., doas_open()) could be very useful in building multi-threaded fileservers for protocols like AFS and NFSv4, running in user-land but sharing filesystems implemented by the OS, rather than filesystems implemented in the same user-land program: such programs need to be able to run system calls like open(2) with the credentials that correspond to the clients on behalf of which the server calls open(2). This prototype provides just that, albeit using fork()ed worker processes under the hood (but the IPC mechanism used is doors, which is very fast). (Note that I/O system calls like read(2), write(2), getmsg(2), putmsg(2), recvmsg(2), pread(2), pwrite(2), and so on don’t need to be proxied through worker processes: what matters is the credential used to create the file descriptor in question, not the credentials of the caller these I/O calls.)
One of the limitations of my friend’s prototype AFS server is that it can’t perform local filesystem operations as the clients. Perhaps my prototype can help.
The library consists of two main functions:
doas_alloc()
, which returns a handle to a worker process and takes a description of the user’s credentials as an argumentdoas_free()
, which releases handles create by doas_alloc()
and then doas_open()
, which is just like open(2)
, but augmented with a worker process handle argument.
Actually, the handle represents a user credential, not a worker process — that a worker process is used is an implementation detail.
The interface for the per-thread credential emulator (which requires interposing on open(2) and friends) consists of a single additional function: doas()
which takes a handle and a callback function and data arguments — calls to open(2)
and friends in the callback function and all functions it calls in the same thread will be changed to calls to doas_open()
with the handle passed to doas()
.
So there you have it. A framework for building function wrappers that take as an additional argument a “user impersonation token” (to borrow from Windows terminology), portable to any Unix that has IPC methods that allow for file descriptor passing. And a framework for emulating per-thread credentials, written with zero extensions to the underlying OS, portable to any Unix whose linker/run-time linker supports interposition.
The API looks like this:
int doas_open(int idx, char *fname, int oflag, mode_t mode); /* get current doas subject index */ int doas_current(void); /* * get a new subject index for a new subject represented by the setup proc and * setup_data (which are called to setup a worker process' environment) */ int doas_idx_new(int (*setup_proc)(void *), void *setup_data); /* release a subject index */ void doas_done(int idx); /* * arrange for interposers called by this same thread to proxy syscalls to the * worker process associated with the given subject index */ int doas(int idx, int (*doas_func)(void *), void *doas_data);
It could be improved somewhat. For example, the handles shouldn’t be int
but int64_t, and the library should make sure no handles are reused in the same process so it can detect dangling references to dead handles.
The code for forking worker processes is interesting to me because it might be a re-usable design. The parent process creates a door, forks a child and then the child creates its door and sends it to the parent via a door call on the parent’s door. That may seem like a complicated dance, but it isn’t really, and I believe that the asymmetry that leads to is is central to the doors IPC story.
The door server forker consists of:
- A rendez-vous point — the function that fork()s the child has to wait for the child to call the parent’s door; the parent’s door server function will run in a separate thread, so a simple condition variable will do
- A door server function in the parent
- A function to setup the rendez-vous, start the parent door server, fork() the child and wait for it, create the child’s door, pass it back to the parent and enter door_return(3DOOR)
- Setting the child worker process’ credentials is done through a callback in this prototype, but it could be done with an octet string token description of the credentials to us
The rendez-vous and the function that forks the workers look like this (pardon the printf()s):
struct doas_fork_rendez_vous {
pthread_cond_t cv;
pthread_mutex_t lock;
int idx;
int dfd;
int this_dfd;
};
static
int
fork_doas_door(int idx, int (*setup_proc)(void *), void *setup_data)
{
pid_t pid;
int dfd;
struct doas_fork_rendez_vous rv;
if (idx >= ndoors)
return (-1);
if (pthread_cond_init(&rv.cv, NULL) != 0)
return (-1);
if (pthread_mutex_init(&rv.lock, NULL) != 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
return (-1);
}
(void) pthread_mutex_lock(&rv.lock);
rv.idx = idx;
rv.dfd = -1;
dfd = door_create(doas_fork_parent, (void *)&rv, DOOR_UNREF);
rv.this_dfd = dfd;
if (dfd < 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
return (-1);
}
if ((pid = fork()) < 0) {
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
(void) door_revoke(dfd);
return (-1);
}
if (pid == 0) {
int dfd2 = -1;
int failed = 0;
door_arg_t darg;
door_desc_t dd;
(void) memset(&darg, 0, sizeof (darg));
darg.data_ptr = NULL;
darg.data_size = 0;
/* child */
dfd2 = door_create(doas_proc, NULL, DOOR_UNREF);
/*
* If setup_proc == NULL we're setting up a worker
* process with the same characteristics as the parent,
* e.g., so the parent can drop privileges but retain
* a door to this privileged worker process.
*/
if (setup_proc != NULL && setup_proc(setup_data) < 0) {
/* failure */
failed = 1;
} else {
dd.d_attributes = DOOR_DESCRIPTOR;
dd.d_data.d_desc.d_descriptor = dfd2;
darg.desc_ptr = ⅆ
darg.desc_num = 1;
}
if (door_call(dfd, &darg) < 0) {
(void) door_revoke(dfd2);
exit(1);
}
if (failed) {
(void) door_revoke(dfd2);
exit(1);
}
/* XXX should rv.cv/rv.lock be cleaned up in the child?? */
/* Service the door */
(void) door_return(NULL, 0, NULL, 0);
exit(1); /* shouldn't happen */
/* NOTREACHED; */
}
/* parent -- wait for child to pass back its door */
printf("Parent is going to sleep on cv\n");
(void) pthread_cond_wait(&rv.cv, &rv.lock);
printf("Parent is back from sleeping on cv\n");
/* we no longer need the door over which the child passed its door */
(void) door_revoke(dfd);
/* cleanup */
(void) pthread_mutex_unlock(&rv.lock);
(void) pthread_cond_destroy(&rv.cv);
(void) pthread_mutex_destroy(&rv.lock);
/* save the child's door */
return (doors[idx] = rv.dfd);
}
The door server function in the parent looks like this:
/* ARGSUSED */ static void doas_fork_parent(void *cookie, char *argp, size_t arg_size, door_desc_t *dp, uint_t n_desc) { struct doas_fork_rendez_vous *rv; rv = (struct doas_fork_rendez_vous *)cookie; if (arg_size == 0 && argp == DOOR_UNREF_DATA) { printf("doas_fork_parent() unref\n"); door_revoke(rv->this_dfd); door_return(NULL, 0, NULL, 0); } if (n_desc == 1) rv->dfd = dp->d_data.d_desc.d_descriptor; /* wake up the parent */ printf("doas_fork_parent() here!\n"); (void) pthread_mutex_lock(&rv->lock); (void) pthread_cond_broadcast(&rv->cv); (void) pthread_mutex_unlock(&rv->lock); printf("doas_fork_parent() cond_broadcast done...!\n"); (void) door_return(NULL, 0, NULL, 0); return; }
The rest is left as an exercise for the reader. Though if this really is useful I can arrange to post the whole thing (probably under the CDDL).
Blackboxes as data center LEGO bricks
•October 19, 2006 • 2 CommentsSo, we have a small data center in a container — 1 TEU. Great! Now we can build a datacenter with components that resemble LEGO bricks. Or can we?
The Blackbox needs power. And it needs [water] cooling. And then there’s backups to think about.
So one might wonder: what might be the ratio of power generation, cooling and fuel bricks to compute bricks?
Brian Utterback thinks we can get about 1000KVA in a TEU. Assuming a .5 power factor, and 250 300W systems in a Blackbox that yields about 5-6 Blackboxes per 1 TEU power generators.
[UPDATE: Generators come in much denser packages. For example, this Kohler 12 RES generator (specs) takes just over 9 sq. ft. of space.]
One liter of Diesel releases about 113 kilowatt-hours (says Wikipedia). A TEU is about
39,000 liters, or about 150 hours of operation at 300kW. I should adjust that for the inefficiency of the generator that would burn this Diesel. So, ~1 container’s worth of Diesel for every five days, per-Blackbox.
Brian also thinks that a 150-ton chiller can fit in 1TEU. Blackbox needs 60 tons of cooling, so we’re looking at one TEU of cooling for every 2.5 Blackboxes. Does the generator need cooling? Or does it cool itself? Gotta find out.
[UPDATE: Carrier has a 600 ton cooling unit in 1 TEU! So the ratios below are all wrong. The true ratios are much better than that. HT: John Hoffman.]
Of course, if you can pump water from a large body of water then you may be able to avoid cooling units altogether. I wonder what quality of water the Blackbox needs. Will river water do? Anyhow, one might be able to cool many compute bricks with much less than 1TEU of pump/filtration equipment.
So, at minimum one would need, rounding up: three TEUs (one blackbox, one generator, one
diesel tank), if you backup over the network.
Ratios: 1 generator TEU and 1 fuel tank TEU for every 6
compute TEUs. If you can’t get cold water then you need to add 1-2 TEUs of
cooling and drop to 4 compute TEUs. Optionally add a tape library TEU and substract a compute TEU. The result: a complete data center in 8 TEUs, 3-6 of which can be compute TEUs, depending on whether you have access to water and wheter you need tape backups.
[UPDATE: So, actually, we have 1 TEU of cooling for every 10 TEUs of compute/storage, and better for power. So we’re talking about 3 TEUs rounded up for power, fuel and cooling for the first ten compute/storage bricks. Wow. And you can put these almost anywhere.]
Speaking of backups… How about backup to disk, using a Blackbox filled with Thumpers? Never ship tapes off-site. Backup to redundant VTLs made of Thumper Blackboxes and, when they fill up, add more, and you can always put one of these on a truck and ship it for safe storage off-site. It seems likely that securing the transportation of a Blackbox should be easier than securing the transportation of a box of tapes too. We used to say “never understimate the bandwidth of a van full of tapes.” Now we might say “never underestimate the bandwidth of a container ship full of Blackboxes.”