Tuesday, March 09, 2010

Unicode

UTF-16 is a way of representing all of the UCS code points in two bytes (or four). It can encode all of the code points from the Basic Multilingual Plane (BMP) in just two bytes, but code points in other planes are encoded into surrogate pairs. UCS-2, as used by SQL Server for all Unicode text, was the precursor to UTF-16 and can only handle code points from the BMP. It is forward compatible with UTF-16, but any code point outside of the BMP encoded in UTF-16 will appear to be two separate code points *inside* the BMP if the encoding is UCS-2. The data will be preserved - it is only the semantics (i.e. the abstract "character(s)") that differ.

Note: This is extremely unlikely for modern business applications, as the languages outside the BMP are academic, and/or historical, such as Phoenician. For these purposes, UCS-2 and UTF-16 can be considered equivalent and interchangeable.

Side note: UTF-8 is just another way of representing all of the UCS code points from all of the planes, but the bits are encoded in a different way to UTF-16.

Extra note: .NET uses UTF-16 for its in memory encoding of the System.String type. A .NET System.Char is limited to 16-bits, and therefore a single char cannot hold a UTF-16 encoded surrogate pair (e.g. any code point not on the BMP). The char data type is similar to UCS-2 in this respect. Getting the char at a specific index of a string will always return a single 16-bit char, essentially breaking up a surrogate pair if one existed at this index of the string.

No comments: