Friday, 22 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 2

"Your manuscript I’ve read my friend
And like the half you’ve pilfered best;
Be sure the piece you yet may mend –
Take courage man and steal the rest."

Anon (c. 1805)

As mentioned in Of Characters and Strings (in .Net, C#, Silverlight …): Part 1 many of the text processing features of .Net are such that code can be re-used over Silverlight, WPF, XNA, Mono, Moonlight etc. In this note I want to draw attention to a few key points and differences over platforms. It is far from a comprehensive analysis and I’ve skipped some important topics including collation, the variety of ways of displaying text ,and input methods.

A quick aside: I failed to highlight in Part 1 how 16-bit confusion in software is not limited to .Net, similar issues arise in Java (although Java has some helpful UTF-32 methods that .Net lacks) and the situation is far worse in the world of C/C++.

I’ll restrict attention to Silverlight 4, Silverlight 3 for Windows Phone 7 OS (WPOS7.0), and .Net 4. In case the significance of the opening quote passes the reader by, it is to be hoped that Silverlight 5 and any new branches on the .Net tree feature increased compatibility, deprecating certain .Net 4 elements where appropriate.

.Net Text in general

Many Silverlight (and WPF) programmers rely mainly on controls and services written by others to deal with text so are unfamiliar with the lower level implications – in this case the main thing is to make sure you test input, processing, and display components with text that contains non-BMP characters, combining character sequences, and a variety of scripts and fonts. You may also like to check out different culture settings. Relying on typing querty123 on a UK/US keyboard and using the default font is a sure-fire recipe for bugs in all but the simplest scenarios.

.Net text is oriented around UTF-16 so you will be using Unicode all or almost all the time. The dev. tools, e.g. Visual Studio 2010 and Expression Blend 4, work pretty well with Unicode and cope with the basics of non-BMP characters. I find the main nuisance with the VS 2010 code editor is that only one font is used at a given time, I can’t currently combine the attractive Consolas font with fall-back to other fonts when a character is undefined in Consolas.

For portability over .Net implementations, the best bet at present is to write the first cut in Silverlight 3 since for this topic SL3 is not far off being a subset of .Net 4. A good design will separate the text functionality support from the platform-specific elements, e.g. it’s not usually a good idea to mix text processing implementation in with the VM in a MVVM design or in the guts of a custom control. I often factor out my text processing into an assembly containing only portable code.

Text display on the various platforms is a large topic in its own right with a number of issues for portability once you go beyond using standard controls in SL, WPF etc. I’ll not go into this here. Likewise with text input methods.

Versions of Unicode

It would be really useful if the various flavours and versions of .Net were assertive on text functionality, e.g. Supports Unicode 6.0 (the latest version) or whatever. By support, I mean the relevant Unicode data tables are used (e.g. scripts, character info) not to imply that all or even most functionality is available for all related languages, scripts and cultures. Currently, the reality is much more fuzzy and there are issues derived from factors such as ISO/Unicode scripts missing in OpenType. Methods like Char.IsLetter can be useful in constructing a 'version detector' to figure out what is actually available at runtime.

Char structure

The .Net Char structure is a UTF-16 code unit, not a Unicode character. Current MSDN documentation still needs further updates by Microsoft to avoid this confusion (as noted in Part 1). Meanwhile the distinction should be kept in mind by developers using Char. For instance although the method Char.IsLetter(Char) returns true if the code unit is question is a letter-type character in the BMP, IsLetter(Char) is not in general a function you would use when looking for letters in a UTF-16 array.

It is therefore often a good idea to use strings or ‘string literal’ to represent characters, in preference to Char and ‘character literal’. Inexperienced programmers or those coming from a C/C++ background may find this odd to begin with, being familiar with patterns like

for (i=0; i < chars.Length; i++) { if (chars[i]=='A' { etc. }};

Fortunately, Char provides methods to work correctly with characters, for instance Char.InLetter(String, Int32). Char.IsLetter("A",0) returns true and the function works equally well for Char.IsLetter("\u10900",0) so long as your platform supports Unicode 5.1 (the version of Unicode that introduced U+10900 𐤀 PHOENICIAN LETTER ALF).

Char is largely portable apart from a few quirks. I find it especially puzzling that Char.ConvertFromUtf32 and Char.ConvertToUtf32 are missing from SL 3/4. Faced with this I wrote my own conversions and use these even on .Net 4.0 where the methods are available in Char, this way keeping my code portable and efficient.

I also wrote a CharU32 structure for use in UTF-32 processing (along with a StringU32) rather than extend Char to contain methods like IsLetter(int) for UTF-32 (as is done in Java). Makes for cleaner algorithms and something Microsoft may want to consider for .Net futures.

The CharUnicodeInfo class in SL 3/4 is a subset, missing GetDecimalDigit, a rather specialist helper for working with decimal numbers in non-Latin scripts and in most cases one would use GetNumericValue which is portable.

CharUnicodeInfo.GetUnicodeCategory and Char.GetUnicodeCategory have subtle differences.

String class

The .Net String is a sequential collection of Char objects whose value is immutable (i.e. its value is never modified but can be replaced. Informally, String encapsulates a read-only UTF-16 array. As with Char, the MSDN documentation tends to confuse Unicode character with UTF-16 code unit.

Comments above on use of string and string literal for Char also apply here. For instance the method IndexOf(Char) can only be used to find a Char, i.e. code unit. IndexOf(String) must be used to find an arbitrary Unicode character. If you try entering '\u10900' in C# in Visual Studio 2010, you will be warned “Too many characters in character literal” a reminder .NET character literals are not quite characters and "\u10900" is needed.

Much of String is portable. I’ll just mention a few differences:

Net 4.0 has extra String methods like StartsWith(String, Boolean, CultureInfo), a method which gives more flexibility in working with multiple cultures. SL 3/4 is oriented around current culture and invariant culture so not well suited to multilingual applications.

The whole model of culture and locale in software is way too complex to go into here, I’ll just say that the old fallacies like physical presence in Iceland means a user can speak Icelandic and never spends US dollars or that individuals use only one language have no place in the 21st Century.

SL 3/4 is missing the Normalize method, a very handy function when working with Unicode.

SL 3/4 has fewer constructors available to the programmer than Net 4 but some of those missing are error prone and wouldn’t normally be used in .Net 4.

StringBuilder class

Useful, gives measureable performance improvements constructing a lot of strings.

Very portable, just a few differences, I’ve not encountered these in real life code yet. Rather odd that Append(Decimal) is missing from SL 3/4 though.

StringInfo class

StringInfo is designed to work with actual Unicode characters so is very useful. I get the impression it is less well known among programmers than it ought to be.

Largely portable. However SL 3/4 is missing the SubstringByTextElements methods and LengthInTextElements both of which are handy for processing Unicode correctly in readable code. I’ve found myself simulating something similar to gain portability, although that being said, the portable enumeration methods make for a more common pattern.

Encoding classes

Encoding classes are often used while reading or writing text streams or converting from one text format to another, usually with a class derived from Encoding: SL 3/4 only provides the UnicodeEncoding (UTF-16) and UTF8Encoding classes, in contrast to .Net 4 which also has ASCIIEncoding, UTF7Encoding, and UTF32Encoding.

ASCII and UTF-7 are understandable, omission of UTF32Encoding is hard to see as anything other than an oversight given that UTF-32 is a useful weapon in the armoury of Unicode programming (although hardly ever a good idea for file streams). One of the first things I did in the early days of Silverlight 2 beta was to write conversion code. I hope Microsoft will add this to Silverlight 5, although for portability reasons it may be years before we can settle down with it.

The base Encoder class in .Net 4 has quite a number of extra methods and properties for Mail and Browser purposes as well as codepage handling. I tend to agree with the decision in SL to drop some of this complexity and discourage use of legacy stream formats even if that means occasionally custom code is required to convert ‘Windows ANSI’, ‘Mac OS Roman’ and the ISO/IEC 8859 series of 8 bit encodings.

The main features of Utf8Encoding such as BOM handling and exceptions are portable. Unicode normalization features are missing from SL 3/4 again, as with the String class.

Not the most portable part of .Net.

No comments:

Post a Comment