On Unicode Normalization — or why normalization insensitivity should be rule
Do you know what Unicode normalization is? If you have to deal with Unicode, then you should know. Otherwise this blog post is not for you. Target audience: a) Internet protocol authors, reviewers, IETF WG chairs, the IESG, b) developers in general, particularly any working on filesystems, networked applications or text processing applications.
Short-story: Unicode allows various characters to be written as a single “pre-composed” codepoint or a sequence of one character codepoint plus one or more combining codepoints. Think of ‘á’: it could be written as a single codepoint that corresponds to the ISO-8859 ‘á’ or as two codepoints, one being plain old ASCII ‘a’, and the other being the combining codepoint that says “add acute accent mark”. There are characters that can have five and more different representations in Unicode.
The long version of the story is too long to go into here. If you find yourself thinking that Unicode is insane, then you need to acquaint yourself with that long story. There are good reasons why Unicode has multiple ways to represent certain characters; wishing it weren’t so won’t do.
Summary: Unicode normalization creates problems.
So far the approach most often taken in Internet standards to deal with Unicode normalization issues has been to pick a normalization form and then say you “MUST” normalize text to that form. This rarely gets implemented because the burden is too high. Let’s call this the “normalize-always” (‘n-a‘, for short) model of normalization. Specifically, in the context of Internet / application protocols, the normalize-always model requires normalizing when: a) preparing query strings (typically on clients), b) creating storage strings (typically on servers). The normalize-always model typically results in all implementors having to implement Unicode normalization, regardless of whether they implement clients or servers.
Examples of protocols/specifications using n-a: stringprep, IMAP/LDAP/XMPP/… via SASL via SASLprepnameprep/IDNA (internationalized domainnames), Net Unicode, e-mail headers, and many others.
I want to promote a better alternative to the normalize-always model: the normalization-insensitive / normalization-preserving (or ‘n-i/n-p‘, for short) model.
In the n-i/n-p model you normalize only when you absolutely have to for interoperability:
- when comparing Unicode strings (e.g, query strings to storage strings);
- when creating hash/b-tree/other-index keys from Unicode strings (hash/index lookups are like string comparisons);
- when you need canonical inputs to cryptographic signature/MAC generation/validation functions;
That’s a much smaller number of times and places that one needs to normalize strings than the n-a model. Moreover, in the context of many/most protocols normalization can be left entirely to servers rather than clients — simpler clients lead to better adoption rates. Easier adoption alone should be a sufficient advantage for the n-i/n-p model.
But it gets better too: the n-i/n-p model also provides better compatibility with and upgrade paths from legacy content. This is because in this model storage strings are not normalized on CREATE operations, which means that you can have Unicode and non-Unicode content co-existing side-by-side (though one should only do that as part of a migration to Unicode, as otherwise users can get confused).
The key to n-i/n-p is: fast n-i string comparison functions, as well as fast byte- or codepoint-at-a-time string normalization functions. By “fast” I mean that any time that two ASCII codepoints appear in sequence you have a fast path and can proceed to the next pair of codepoints starting with the second ASCII codepoint of the first pair. For all- or mostly-ASCII Unicode strings this fast path is not much slower than a typical for-every-byte loop. (Note that strcmp() optimizations such as loading and comparing 32 or 64 bits at a time apply to both, ASCII-only/8-bit-clean and n-i algorithms: you just need to check for any bytes with the high bit set, and whenever you see one you should trigger the slow path.) And, crucially, there’s no need for memory allocation when normalization is required in these functions: why build normalized copies of the inputs when all you’re doing is comparing or hashing them?
We’ve implemented normalization-insensitive/preserving behavior in ZFS, controlled by a dataset property (see also; see also; rationale). This means that NFS clients on Solaris, Linux, MacOS X, *BSD, Windows will interop with each other through ZFS-backed NFS servers regardless of what Unicode normalization forms they use, if any, and without having to have modified the clients to normalize.
My proposal: a) update stringprep to allow for profiles that specify n-i/n-p behavior, b) update SASLprep and various other stringprep profiles (but NOT Nameprep, nor IDNA) to specify n-i/n-p behavior, c) update Net Unicode to specify n-i/n-p behavior while still allowing normalization on CREATE as an option, d) update any other protocols that use n-a and which would benefit from using n-i/n-p to use n-i/n-p.
Your reactions? I expect skepticism, but think carefully, and consider ZFS’s solution (n-i/n-p) in the face of competitors that either normalize on CREATE or don’t normalize at all, plus the fact that some operating systems tend to prefer NFC (e.g., Solaris, Windows, Linux, *BSD) while others prefer NFD (e.g., MacOS X). If you’d keep n-a, please explain why.
NOTE to Linus Torvalds (and various Linux developers) w.r.t this old post on the git list: ZFS does not alter filenames on CREATE nor READDIR operations, ever [UPDATE: Apparently the port of ZFS to MacOS X used to normalize on CREATE to match HFS+ behavior]. ZFS supports case- and normalization-insenstive LOOKUPs — that’s all (compare to HFS+, which normalizes to NFD on CREATE).
NOTE ALSO that mixing Unicode and non-Unicode strings can cause cause strange codeset aliasing effects, even in the n-i/n-p model (if there are valid byte sequences in non-Unicode codesets that can be confused with valid UTF-8 byte sequences involving pre-composed and combining codepoints). I’ve not studied this codeset aliasing issue, but I suspect that the chances of such collisions with meaningful filenames is remote, and if the filesystem is setup to reject non-UTF-8 filenames then the chances that users will be able to create non-UTF-8 filenames without realizing that most such names will be rejected is infinitesimally small. This problem is best avoided by disallowing the creation of invalid UTF-8 filenames; ZFS has an option for that.
UPDATE: Note also that in some protocols you have to normalize early for cryptographic reasons, such as in Kerberos V5 AS-REQs when not using client name canonicalization, or in TGS-REQs when not using referrals. However, it’s best to design protocols to avoid this.
For an example of real-world Unicode interop problems resulting from partially implemented normalize-always and different Unicode normalization preferences on different OSes see this: https://kerneltrap.org/mailarchive/git/2008/5/4/1719004
Nicolas Williams said this on April 13, 2010 at 14:00 |
Oh, I just noticed that Linus’ complaint really is about ZFS on MacOS X, which, when it was being ported, apparently had been modified to normalize on CREATE — but on Solaris ZFS does not such thing.
Nicolas Williams said this on April 13, 2010 at 14:04 |
I am not sure that this is the safe thing to do. How are you going to guarantee that your normalization-ignoring string compare, hash, etc. functions agree about what Unicode string equivalence is?
I also doubt that "I suspect that the chances of such collisions with meaningful filenames is remote" is a strong enough argument for file system code. I wouldn’t know what meaningful filenames in, say, Polish or Azerbaijani would look like.
Reinder said this on April 14, 2010 at 10:52 |
Unicode versioning is a completely separate issue, and affects n-a as well as n-i/n-p. The thing to do with respect to Unicode versioning is to reject the use of unassigned codepoints in new storage strings (e.g., clients MAY send strings with unassigned codepoints [from the server’s point of view], but servers MUST reject them).
One thing that’s worth noting is that NFC (and NFKC) is closed to new precompositions, which is to say that for all new precomposed characters that might be added in the future, their canonically decomposed form is also their NFC form. But of course, this alone is not sufficient: you must still reject the use of unassigned codepoints.
Nico said this on April 14, 2010 at 11:26 |
Also, with respect to collisions… Applying n-i/n-p to potentially-not-UTF-8 strings is a lot safer than applying n-a to the same. Of course, it’s much better to just reject non-UTF-8 strings, but there’s nothing you can do about the potential that some client out there is sending non-UTF-8 strings that happen to be valid UTF-8 strings. The reason is that we don’t tag strings with codeset information everywhere, or, even, really, _anywhere_.
Sadly, as far as NFSv4 is concerned, most, if not all clients just-send-8. The OpenSolaris NFSv4 server also just-uses-8. ZFS can be configured to reject non-UTF-8, and it can be configured to be n-i/n-p, so that the NFSv4 service in OpenSolaris, with ZFS underneath, behaves as well as you can get it to in the face of clients that just-send-8.
Nico said this on April 14, 2010 at 11:34 |