Tuesday, 19 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 1

“The time has come,” the Walrus said,
“To talk of many things:
Of shoes—and ships—and sealing-wax—
Of characters—and strings—
And why the sea# is boiling hot—
And whether pigs have wings.”

(With apologies to Lewis Carroll, the Walrus, and the Carpenter).

During discussion of my comments ISO/Unicode scripts missing in OpenType on the Unicode mailing list, the point came up about desirability of greater understanding of Unicode among programmers and others involved with software development. For a start, there is one popular myth to dispel, the subject of this post which I hope to be the first of several notes on Unicode in .Net.

Myth debunk: a Unicode character is neither a cabbage nor a 16 bit code.

The origin of 16-bit confusion lies in the history of Unicode. Twenty years ago there were two initiatives underway to replace the already out-dated and problematic variety of 7/8-bit character encodings used to represent characters in modern scripts. A true Babel of ‘standard’ encodings back then made it impractical to write software to work with the worlds writing systems without a tremendous level of complexity. Unicode was originally conceived as a 16 bit coding to replace this mess. Meanwhile, the International Organization for Standardization (ISO) was working on ISO 10646 the ‘Universal Character Set’ UCS with space for many more characters than a 16-bit encoding has room for. The original ISO proposals for encoding were widely regarded as over complex so the ISO/Unicode approaches were merged by the time Unicode 2.0 was released in 1996. ISO 10646 now defines the Universal Character Set for Unicode. With unification, the notion of 16-bit characters became obsolete although a 16-bit encoding method remains (UTF-16) along with the popular 8-bit coding (UTF-8) and a 32-bit coding (UTF-32). Each encoding has its virtues. UTF stands for Unicode Transformation Format.

To understand what constitutes the Unicode notion of ‘character’, refer to http://www.unicode.org/versions/Unicode6.0.0/ (or the earlier version while the text of 6.0 is being completed). I will try to summarize briefly.

1. An abstract character is a unit of information for representation, control or organization of textual data. A Unicode abstract character is an abstract character encoded by the Unicode standard. Abstract characters not directly encoded in Unicode may well be capable of being represented by a Unicode combining character sequence. Each Unicode abstract character is assigned a unique name. Some combining sequences are also given names in Unicode, asserting their function as abstract characters.
2. A Unicode encoded character can be informally thought of as an abstract character along with its assigned Unicode code point (an integer in the range 0 to 10FFFF hexadecimal, the Unicode codespace). As noted above it is also assigned a unique name.
3. A Unicode character or simply character is normally used as shorthand for the term Unicode encoded character.

Here are two useful ways of describing Unicode characters:

U+006D LATIN SMALL LETTER M
U+13000 EGYPTIAN HIEROGLYPH A001
U+1F61C FACE WITH STUCK-OUT TONGUE AND WINKING EYE

And similar with the actual character displayed

U+006D – m – LATIN SMALL LETTER M
U+13000 – 𓀀 – EGYPTIAN HIEROGLYPH A001
U+1F61C – 😜 – FACE WITH STUCK-OUT TONGUE AND WINKING EYE

The first form is often preferable in scenarios where font support might not be present to display the actual character although on this blog I prefer to use the characters to encourage font diversity.

Note the conventional use of hexadecimal to state the value of the Unicode code point. This convention is different to that used in HTML where characters as numeric entities are written using decimal numbers rather than hexadecimal, e.g. 𓀀 (13000 hexadecimal equals 77824 decimal).

From a programming perspective, the simplest way of representing Unicode is UTF-32 where each code point fits comfortably into a 32 bit data structure, e.g. in C# a uint or int (C/C++ programmers note C# defines as 32 bit, the size does not vary with CPU register size). Not entirely trivial because there may still be combining sequences. However UTF-32 is not used all that much in practice, not least because of memory cost.

Nowadays, most files containing Unicode text use UTF-8 encoding. UTF-8 uses 1 byte (octet) to encode the traditional 127 ASCII characters and up to 4 bytes to encode other characters. XML and HTML files are popular file formats that use Unicode (Mandatory in XML, optional in HTML where a surprising amount of the web, possibly 50%, still uses legacy encodings). I strongly recommend UTF-8 for text files rather than UTF-16 or legacy 8-bit encodings aka code pages etc. Having worked on several multilingual content-intensive projects, this is the golden rule, although I won’t expand further today on the whys and wherefores. [However I ought to mention the catch that is the ‘Byte order mark’, a byte sequence (0xEF, 0xBB, 0xBF) sometimes used at the start of a UTF-8 stream to assert UTF-8 not legacy text; this can confuse the novice particularly with ‘.txt’ files which can be Unicode or legacy. Windows Notepad uses BOM for Unicode text files. Visual Studio 2010 also uses BOM to prefix data in many file types including XML, XAML and C# code.]

UTF-16 is very popular with software writers working in C/C++ and .Net languages such as C#. A version of UTF-16 was the standard data format for Unicode 1.0. Unicode characters with character codes less than 0x10000 are said to belong to the Unicode BMP (Basic Multilingual Plane) and these are represented by one 16 bit number in UTF-16, other characters require two 16 bit numbers i.e. two UTF-16 codes from a range that do not encode characters, the so called surrogate code points dedicated to this purpose. As of Unicode 6.0, fewer than 50% of characters belong to the BMP but BMP characters account for a huge proportion of text in practice. This is by design; all popular modern languages have most script/writing system requirements addressed by the BMP and there are even specialist scripts such as Coptic defined here. Processing UTF-16 is often more efficient than UTF-8 and in most cases uses half the memory of UTF-32, all in all a good practical compromise solution.

Which brings me back to the 16-bit myth. The fact that so many popular characters belong to the BMP and only require one code unit in UTF-16 means it is easy to be mistaken into thinking most means all. The problem doesn’t even arise with UTF-8 and UTF-32 but the fact is much software uses UTF-16, indeed UTF-16 is essentially the native text encoding for Windows and .Net.

Example sources of 16-bit confusion:

The article on character sets at http://www.microsoft.com/typography/unicode/cs.htm is brazen:



This article is dated to 1997 but was probably written much earlier. Windows NT 3.1 (1993) was notable as the first computer operating system to use Unicode as its native text encoding and Microsoft deserves credit for this, alongside Apple who also did much to help early uptake of Unicode (but would not have a new operating system until OSX was released in 2001). I’m quoting this as an example of the fact that there are many old documents on the Web, confusing even when from reputable sources. I should mention, in contrast, much of MSDN (and indeed much of the relevant information on Wikipedia) is pretty up to date and reliable although not perfect on this subject.

The definition of the .Net Char structure on MSDN, http://msdn.microsoft.com/en-us/library/system.char.aspx, is much more recent.



Er, no. Char is not a Unicode character. It is a 16 bit Unicode code unit in UTF-16. Actually, this is explained later on in the Char documentation but the headline message is confusing and encourages programmers to use Char inappropriately.

The reasons I chose the Microsoft examples rather than the myriad of other confusing statements on the web are twofold. Firstly I'm focussing on .Net, C# etc. here. Secondly, Microsoft are generally ahead of the game with Unicode compared with other development systems which makes errors stand out more.

Fact is .Net actually works very well for software development with Unicode. The basic classes such as 'String' are Unicode (String is UTF-16) and it is almost true to say it is harder to write legacy than modern.

I had hoped to get a little further on the actual technicalities of working with Unicode characters and avoiding 16-bit pitfalls but time has proved the enemy. Another day.

Just three useful (I hope) points on .Net to conclude.

1. Code that works with String and Char should avoid BMP-thinking, e.g. if you want to parse a String, either avoid tests like IsLetter(Char) or wrap their usage in logic that also handles surrogates.

2. String, Char and the useful StringInfo class belong to the System namespaces and are pretty portable over the gamut of .Net contexts including Silverlight, WPF, XNA as well as the Novell parallel universe with Mono, MonoTouch, Moonlight etc. With a little care it can be straightforward to write text processing code that works across the board to target Windows, Mac, Linux, WP7 and whatever comes next.

3. Always test text-related code with strings that include non-BMP characters, and preferably also with data that includes combining sequences and usage instances of OpenType features such as ligatures.

No comments:

Post a Comment