ZFS and the automounter

•April 25, 2006 • Leave a Comment

So, you have ZFS, and you create lots and lots of filesystems, all the
time, any time you like.

Problem: new filesystems don’t show up in /net (-hosts special automount
map) once a host’s /net entry has been automounted.

Solution #1: wait for OpenSolaris to gain support for client-side
crossing of NFSv4 server-side mount-points.

Solution #2: replace the -hosts map with an executable automount map that
generates hierarchical automount map entries based on showmount -e output.

The script is pretty simple, but there is one problem: the automounter
barfs at entries longer than 4095 characters (sigh).

#!/bin/ksh
exec 2>/dev/null
if [[ $# -eq 0 ]]
then
logger -p daemon.error -t ${0##*/} "Incorrect usage by automountd!"
exit 1
fi
entry="/ $1:/"
showmount -e $1|sort|grep '^/'|while read p junk
do
entry="$entry $p $1:$p"
done
print "$entry"
logger -p daemon.debug -t ${0##*/} "$entry"

Comparing ZFS to the 4.4BSD LFS

•December 7, 2005 • Leave a Comment

Remember the 4.4BSD Log Structured Filesystem? I do. I’ve been thinking
about how the two compare recently. Let’s take a look.

First, let’s recap how LFS works. Open your trusty copies of “The
Design and Implementation of the 4.4BSD Operating Systems”
to
chapter 8, section 3 — or go look at the Wikepedia
entry for LFS
, or read any of the LFS papers referenced therein. Go
ahead, take your time, this blog entry will wait for you.

As you can see, LFS is organized on disk as a sequence of large
segments, each segments consisting of smaller chunks, the whole forming
a log file of sorts. Each small chunk consists of data and meta-data
blocks that have been modified/added at the time that the chunk was
written. LFS, like ZFS, never overwrites data, excepting superblocks, so
LFS does, by definition, copy-on-write, just like ZFS. In order to be
able to find inodes whose block locations have changed LFS maintains a
mapping of inode numbers to block addresses in a regular file called the
“ifile.”

That should be enough of a recap. As for ZFS, I assume the reader has
been reading the same ZFS blog entries I have been reading. (By the way,
I’m not a ZFS developer. I only looked at ZFS source code for the first
time two days ago.)

So, let’s compare the two filesystems, starting with generic aspects of
transactional filesystems:

  • LFS writes data and meta-data blocks everytime it needs to
    fsync()/sync()
  • Whereas ZFS need only write data blocks and entries in its intent log
    (ZIL)

This is very important. The ZIL is a compression technique that allows
ZFS to safely defer many writes that LFS could not. Most LFS meta-data
writes are very redundant, after all: writing to one block of a file
implies writing new indirect blocks, a new inode, a new data block for
the ifile, new indirect blocks for the ifile, and a new ifile inode —
but all of these writes are easily summarized as “wrote block #
to replace the block of the file whose inode # is .”

Of course, ZFS can’t ward off meta-data block writes forever, but it can
safely defer them with its ZIL, and in the process stands a good chance
of being able to coalesce related ZIL entries.

  • LFS needs copying garbage collection, a process involving both,
    searching for garbage and relocating data surrounding garbage
  • Whereas where LFS sees garbage ZFS sees snapshots and clones

The Wikipedia LFS entry says this about LFS’ lack of snapshots and
clones: “LFS does not allow snapshotting or versioning, even though both
features are trivial to implement in general on log-structured file
systems.” I’m not sure I’d say “trivial,” but, certainly easier than in
more traditional filesystems.

  • LFS has an ifile to track inode number to block address mappings. It
    has to, else how to COW inodes?
  • ZFS has a meta-dnode; all objects in ZFS are modelled as a “dnode” and
    so dnodes belie inodes, or rather, “znodes,” and znode numbers are
    dnode numbers. Dnode-number-to-block-address mappings are kept in a
    ZFS filesystem’s object set’s meta-dnode much as in LFS inode-to-block
    address mappings are kept in the LFS ifile.

It’s worth noting that ZFS uses dnodes for many purposes besides
implementing regular files and directories; some such uses do not
require many of the meta-data items associated with regular files and
directories, such as ctime/mtime/atime, and so on. Thus “everything is a
dnode” is more space efficient than “everything is a file” (in LFS the
“ifile” is a regular file).

  • LFS does COW (copy-on-write)
  • ZFS does COW too

But:

  • LFS stops there
  • Whereas ZFS uses its COW nature to improve on RAID-5 by avoiding the
    need to read-modify-write, so RAID-z goes faster than RAID-5. Of
    course, ZFS also integrates volume management.

Besides features associated with transactional filesystems, and besides
volume management, ZFS provides many other features, such as:

  • data and meta-data checksums to protect against bitrot
  • extended attributes
  • NFSv4-style ACLs
  • ease of use
  • and so on

OpenSolaris Opening Day — Where’s ssh?

•June 10, 2005 • Leave a Comment

Yay! OpenSolaris is out. The amount of work, energy and dedication that the OpenSolaris team has put into the project is fantastic. They got 132 engineers to write 215,000 words’ worth of blog entries for opening day (not enough, I suppose, given how many engineers we have, but that’s a large novel’s worth of postings written in a few short weeks — readers will be busy digesting all that material!). Crucial to this effort was easy blogging tech; blogging is here to stay, you’ll only see more of it.

O.K., but, where’s the Solaris Secure Shell? Ah, well, it will be there. It’s not there now, and that’s partly my fault as I didn’t notice that it wouldn’t be, and why, until it was almost too late, and my initial stab at a build system fix that would have allowed ssh to be in OpenSolaris was incomplete and so I missed the deadline. Please accept my apologies, readers and OpenSolaris team alike.

For more info on OpenSolaris and missing crypto code, see Darren’s blog entry on the topic.

An Ode to the Linker

•June 10, 2005 • 12 Comments

I’m no poet, but…

The Solaris linker is one of my favorite parts of Solaris. Long before coming to Sun I acquainted myself with the excellent Linker and Libraries Guide at docs.sun.com and used various linker tricks for fun and profit.

I see the linker as a set of extensions to the C language:

  • The control over the namespace afforded by object grouping amounts to a form of packages, in the Lisp or Perl5 sense.
  • The dlopen(3C) family of functions provides for run-time extensibility.
  • Symbol interposition provides a rudimentary form of “around methods,” if you’ll allow the abuse of terminology.
  • Filters provide a facility for coping with moves of interfaces from one object to another, and more.
  • Map files, together with header files, provide more control over the definition of an object’s ABI than just the headers alone.

Developers would be well advised to read the Solaris Linker and Libraries Guide, as well as Rod Evan’s and Michael Walker’s blogs.

Introduction

•June 8, 2005 • Leave a Comment

Since coming to Sun, almost three years ago, I’ve worked on the Solaris Secure Shell (ssh) and, to a lesser degree, on Solaris’ implementation of Kerberos V. Now I’m working on PAM and GSS-API issues, some involving standardization efforts at the IETF, particularly at the KITTEN Working Group.

I should have been blogging quite a lot by now, but for one reason or another I have not.
The advent of OpenSolaris will see me blogging more, I’m sure. Subject matter for blog entries does abound; I could blog on ssh, Kerberos, the GSS-API, the IETF, PAM, Least Privilege, and much more. Suggestions are welcome.