Filesystem I18N

The FS I18N Problem

So you’re a global company, with users in many countries, speaking many languages and, therefore, using various locales. Your users travel, and they communicate with each other, sharing files in project shares, etcetera.

Their documents can contain non-US-ASCII text just fine, depending on the applications they use (say, StarOffice).

Their filenames, on the other hand, cannot. That’s because in the world of POSIX filesystems are 8-bit clean (filenames can contain any byt values other than 0x0 (NUL) and 0x2F (‘/’). “8-bit clean” is foul language in the world of internationalization: it typically means the system doesn’t track what codeset is used for what strings as it’s all just a bunch of bytes. And that’s exactly how POSIX systems deal with filenames.

If all users use locales with the same codeset then users should never see garbage for filenames. Since Solaris has lots of UTF-8 locales nowadays you can, in fact, have your users all use UTF-8 locales.

But there is legacy to worry about:

  • legacy filesystem content
  • legacy clients and servers
  • legacy habits
  • legacy rendering engines

and the interoperability problems that arise from legacy.

So you can prohibit use of non-UTF-8 locales, and do your best to clean up non-UTF-8 filesystem content. For now this is the best answer.

But it doesn’t get OS engineers off the hook entirely. There are several things that we need to worry about, or that we can do.

The OS Could do More

Where possible we ought to do codeset conversions automatically. That’s harder than it sounds, but not impossible.

And we need to worry about Unicode normalization.

An ASCII art picture seems appropriate right now:

+----------------------+   +----------+
|POSIX app process     |   | NFS clnt |
|  (user-land)         |   +----------+
| -------------------- |         ^
| libc stubs           |         |
|       ^              |         |
|       |              |         |
|       |   kernel-land          v    |
|+------|-------------+  +----------+ |
||POSIX . system calls|  |NFS server| |
|+------.-------------+  +-------^--+ |
|       .                        |    |
|       .                        |    |
| +-----v------------------------v-+  |
| |                VFS             |  |
| | +-------------+     +-------+  |  |
| | |    VOP      |     |DNLC   |  |  |
| | |-------------|<...>|   ^   |  |  |
| | |    fop      |     +---|---+  |  |
| | +-------------+         |      |  |
| +-------------------------|------+  |
|                ^          v         |
|    FS Modules  | (ZFS, UFS, ...)    |
|                |                    |
|                v                    |
|    filesystem instance              |

Most of the components shown in that picture in most POSIX OSes, Solaris included, are blissfully unaware of codesets, encodings and normalization. Strings representing filesystem object names are certainly not tagged with codeset/encoding information.

NFSv4 [RFC3530] does say “thou shall use UTF-8 for filesystem object names” (paraphrase). But most clients and servers do not enforce this. Legacy NFSv2/3 clients and servers certainly don’t — they never had to.

If we wanted to introduce automatic codeset conversions into this picture we’d have to find boundaries where there is knowledge of what codesets are expected on either side of the boundary. No such boundaries exist in that figure… unless, that is, we define some conventions.

If we declare “thou shall use UTF-8 in the middle” then we can quickly find appropriate boundaries for codeset conversion:

  • libc knows what locale is in use in user-land and now would know that UTF-8 is expected by the kernel given a UTF-8-in-the-middle convention, so libc syscall stubs could perform codeset conversions
  • NFSv4 clients know that servers should expect UTF-8, and they should know what local applications expect (see previous bullet), so, NFSv4 clients can perform whatever codeset conversions they wish
  • NFSv4 servers can enforce use of UTF-8 and, as courtesy, could perform codeset conversions for legacy clients when they know about them (how would they know? via out of band configuration most likely)
  • filesystem modules can perform codeset conversions too (e.g., you could declare that /export/foo allows only names encoded in ISO8859-15), or encoding conversions (e.g., NTFS wants UCS-2/UTF-16)

Of course, if you’re not in a UTF-8 locale codeset conversions will only just decrease the amount of garbage the user might see, and, more importantly, the amount of garbage the user can create. But won’t get rid of all opportunities for garbage (how does one represent kanji characters in ISO8859-1? right, one does not). For a small improvement users would pay what could be a large performance cost. As long as they don’t create non-UTF-8 names all should be OK… So we at least need an option to exclude non-UTF-8 names from the filesystems.

Finally, Unicode Normalization

Having solved the codeset conversion problem (ha!) we can now look at normalization.

Check this out:

solaris-client% touch /net/macos-server/foo/á
solaris-client% cp -r /net/macos-server/foo /tmp
solaris-client% cat /tmp/foo/á
cat: cannot open /tmp/foo/á

What happened? Well, I entered a-with-acute in my gnome-terminal and the input method produced the composed LATIN SMALL LETTER A WITH ACUTE codepoint (U+00E1) codepoint. But Mac OS X normalized to NFD — that is, it decomposed this to U+0061 (ASCII ‘a’) U+0301 (COMBINING ACUTE ACCENT). When I copied that file to Solaris I copied the decomposed name.

And you can see what happens then: I enter a filename that I think ought to match the file’s actual name, and looks like it does, and should match it, but in fact does not!

Here we have Unicode’s ability to represent compositions in more than one equivalent way combining with an 8-bit clean system to punish the user. If the application had been a GUI, with a file selection combo box, then chances are that I wouldn’t notice any problems as long as I clicked on the file I wanted, but let me type its name and things break.

Most operating systems out there, Windows and Solaris included, just-don’t-normalize. Because typical input methods produce pre-composed codepoints noone notices any problems. But Mac OS X does normalize: it normalizes filenames given as inputs to LOOKUP and CREATE operations, and it normalizes to a form (NFD) that is different from that of typical input methods on other operating systems.

So, what to do?

We could take the Mac OS X approach: normalize on LOOKUP and CREATE, possibly to NFC instead of NFD (to better match display capabilities on Solaris renderers).

Or we could choose to be normalization-insensitive on LOOKUP and normalization-preserving on CREATE.

The latter interops best, but is also more expensive. It’s also more correct — we don’t have to worry about applications that do silly things like CREATE then READDIR and look for the thing created. Fortunately we can fast-path processing of ASCII names.

Then again, normalization-insensitiveness has some complications, as it’s not enough to have primitives for comparing strings without regard to composition. There are places where the system hashes strings, such as in the DNLC (directory name lookup cache), and we may not want to be normalizing entire strings there as that would involve memory allocation. So we might need a primitive to normalize strings in small incremental steps, so hash functions can normalize their string inputs without having to memory.

In closing I should point out that these two approaches to dealing with normalization both assume that strings that hit the filesystem are already in UTF-8, that to address normalization we must first establish an I18N convention as described above.

~ by nico on December 15, 2006.

2 Responses to “Filesystem I18N”

  1. You have a typo (html syntax error) that hides text in the second paragraph. Good article

  2. Thanks for letting me know about the typo! Glad you liked the article.

Leave a Reply

Your email address will not be published. Required fields are marked *