Journal of Total Obscurity: 2010

Monday, 8 November 2010

Simplified Egyptian: Numerals

This is the second of a series of notes on a systematic way of working with Ancient Egyptian Hieroglyphs in Unicode, following up from Simplified Egyptian: A brief Introduction.

What follows makes a lot more sense if you have a hieroglyphic font installed, see my post Egyptian Hieroglyphs on the Web (October 2010).

Ancient Egyptian, in common with other early mathematical systems, had no notion of negative integers or the digit Zero. The Egyptian numeral system is not positional in the modern sense. It is nevertheless straightforward to decode. Examples:

Egyptian 𓎉𓏻 is 42 (𓎉 represents 40, 𓏻 represents 2).

Egyptian 𓆿𓍣𓎉𓏻 is 4,242 (𓆿 represents 4000, 𓍣 represents 200).

Our modern decimal system uses positional notation where the numerals 0, 1 … 9 are used to represent units, tens, hundreds etc. by virtue of position. The Ancient Egyptians used different symbols based on a tally system as should be obvious from the examples. Fortunately, one similarity to modern notation is that the higher magnitude quantities were normally written first (i.e. to the left in Simplified Egyptian, which is always written left to right).

Normalized forms of numerals

The following list gives the preferred representation of hieroglyphs in Unicode for numerals in Simplified Egyptian.

1 to 9: 𓏺, 𓏻, 𓏼, 𓏽, 𓏾,𓏿, 𓐀, 𓐁, 𓐂.
10 to 90: 𓎆, 𓎇,𓎈, 𓎉, 𓎊, 𓎋, 𓎌, 𓎍, 𓎎.
100 to 900: 𓍢, 𓍣, 𓍤, 𓍦, 𓍦, 𓍧, 𓍨, 𓍩, 𓍪.
1,000 to 9,000: 𓆼, 𓆽, 𓆾, 𓆿, 𓇀, 𓇁, 𓇂, 𓇃, 𓇄.
10,000 to 90,000: 𓂭, 𓂮, 𓂯, 𓂰, 𓂱, 𓂲, 𓂳, 𓂴, 𓂵.
100,000: 𓆐
1,000,000: 𓁨

Each of these forms is available in Unicode as a unique character. For instance hieroglyph 2 is the character U+133FB 𓏻 EGYPTIAN HIEROGLYPH Z015A. Use these ‘normal’ forms for basic writing of numbers in Simplified Egyptian and avoid practices such as repeating 𓏺 for 𓏻 unless there is a compelling reason.

Note that large numbers such as 𓁨𓁨𓆐𓆐𓂮𓆽𓍣𓎇𓏻 2,222,222 were not generally encountered in ancient texts so replicating the 𓁨 and 𓆐 is rather anachronistic. An alternative multiplicative notation evolved for large numbers although uses are apparently rare so I’ll defer this topic for now.

Alternative forms of numerals

The use of normalized forms as given above makes it easy to find a number such as 𓎉𓏻 (42) in web documents, word processor and spreadsheet documents, and so forth (so long as software is sufficiently up to date of course). Unicode provides some alternative forms such as U+13403 𓐃 EGYPTIAN HIEROGLYPH Z015I (numeral 5) but these alternates should be avoided for numerals in Simplified Egyptian where at all possible (𓐃 actually has a specific use as a fraction).

Other arrangements are found in Egyptian texts, such as the following form of 35 ( from Gardiner, Egyptian Grammar, p194).

Simplified Egyptian takes the position that these kinds of numeral groups are a matter for more elaborate treatments of hieroglyphs where it is not acceptable to take license and write the number as 𓎈𓏾.

Repeating numeral 1 twice may look very much like numeral 2 in a hieroglyphic font but this practice should be avoided in Simplified Egyptian unless there is a good reason. The rationale is because most Ancient Egyptian mathematics survives in hieratic rather than hieroglyphic writing and the numerals were often simplified into a less tally-like glyph appearance. The fact that modern discussion of the hieratic often uses a hieroglyphic presentation should not detract from the original character-like behaviour. There is the important practical point that web searches and text processing work far better with normalized forms.

Rotated versions of units (e.g. 𓐄, 𓐅 …) and tens (𓎭 and 𓎮) are used in hieratic (and sometimes hieroglyphic) to number days of the month. Simplified Egyptian also adopts this convention (I hope to return to this on a topic about calendars).

Confusables

The stroke hieroglyphs U+133E4 𓏤 EGYPTIAN HIEROGLYPH Z001 (representing unity and ideogram) U+133FB 𓏺 EGYPTIAN HIEROGLYPH Z015A (numeral 1) are distinguished in Unicode. Fonts usually make the numeral stroke taller then the ideogram stroke, reflecting Ancient Egyptian conventions. Texts encoded in MdC often do not make this distinction but it is strongly recommended to do so in Simplified Egyptian so as to enable accurate text processing.

Likewise, the plurality signs U+133E5 𓏥 EGYPTIAN HIEROGLYPH Z002 and U+133E6 𓏦 EGYPTIAN HIEROGLYPH Z002A should be distinguished from numeral 3 U+133E5 𓏼 EGYPTIAN HIEROGLYPH Z015B.

In some fonts, characters such as U+0131 ı LATIN SMALL LETTER DOTLESS I and U+006C l LATIN SMALL LETTER L may look very similar to the Egyptian stroke. There are various other opportunities for confusion, for instance numeral 10 𓎆 can look very similar to U+2229 ∩ INTERSECTION and some other characters.

Other examples are the special forms for 1, 2 and 3 used in dates potentially confusable with MINUS SIGN, HYPHEN and other dashes (1), EQUALS SIGN (2), and IDENTICAL TO (3) but should never appear in a context where the meaning is unclear. The special form of 10 looks rather like SUBSET OF.

Simplified Egyptian hieroglyphs should never be written with any non-Egyptian characters just beacause they look similar.

Mathematics beyond numerals

Cardinal numbers, fractions, weights, lengths, and other measurements are matters for future topics about Simplified Egyptian.

Update. Apparently, according to Google, this note is the first writing of 𓎉𓏻 on the web, a reminder it will be interesting to see how use of hieroglyphs grows in months and years to come.

Thursday, 4 November 2010

Silverlight in the News

Silverlight made it onto the BBC News on Tuesday – Coders decry Silverlight change. Take an unfortunate choice of words by a senior executive or two; add the reactive and ill-informed commentators on some web message boards; then mix in some natural concerns from developers. Bang! Tempests in teacups, the media love them.

Personally speaking, I find it reassuring to observe that the amateur tradition is alive and well in Microsoft and at least one major multinational company is not self-wrapped in a cloak of PR and spin-doctoring. That being said, the last few minutes of Doctor Who The Christmas Invasion ought to be made compulsory viewing for all senior executives.

As a developer I'm happy so long as .Net is treated as a strategic family of products. Thanks to Novell it may become so on Unix/Linux too (even if the Linux ‘community’ is slow to recognize what the third wave of Unix is really about). Hey theres another tabloid headline: C/C++ is dead!

I hope I'm not alone in being pleased to learn Silverlight 5 is not being rushed out. Especially if it means some of the niggles are resolved and the SL/WP7/WPF portability model improved. And Unicode 6.0 of course! A Mix 2011 Beta with Summer release please.

Two real news stories for developers:

An interesting talk at PDC 2010 for C# developers: ‘The Future of C# and Visual Basic’ by Anders Hejlsberg – don’t be put off like I almost was by the Visual Basic tag, it is hardly mentioned so we are not subjected yet again to the irony implicit in the keyword Dim. The main theme is simplification of Asynch programming with the new await keyword for the next .Net revision. Along with parallel constructs, this pattern brings very useful ways of exploiting multi-core processors to .Net in a clean software design. The talk is summarised here.

Developers in the .Net/WPF/Silverlight space should also check out PDC 2010: 3-Screen Coding: Sharing code between Windows Phone, Silverlight, and .NET by Shawn Burke. I alluded to the value of portable code last month in Of Characters and Strings although I didn’t highlight the .Net 4 changes that enable sharing of binary assemblies (a topic in its own right). The new tooling for Visual Studio to assist in creating Portable Assemblies, as previewed by Shawn, should be very helpful in managing the shared assembly model. It should also help focus Microsoft development on removing some of the irritating incompatibilities between Silverlight and WPF.

I just can’t wait for await.

Bob Richmond

Monday, 1 November 2010

Windows: The 25th Anniversary

The first version of Microsoft Windows was released over 25 years ago.

The conventional release date quoted is that of the Microsoft Windows 1.0 retail product for ‘IBM compatible’ PCs launched on 20th November 1985. The truth, as usual, is a little different. In the beginning, as now, Windows was distributed by computer manufacturers (OEMs) and OEM releases were shipping weeks before the Microsoft retail set, possibly as early as September. In the pre-internet era product launches worked differently to nowadays.

Since blogging on My first Home PC – recollections of the RM Nimbus PC-186 I’ve browsed some of my notes surviving from 1985. There was a lot of interest in a pre-Beta of Windows I built for the BETT show in January to accompany the PC-186 launch. Windows was fairly stable by then, I'd been running prototypes during 1984 tracking alpha versions of Microsoft code. The new Intel 80186 CPU in combination with a larger memory space made Windows more fluid on the Nimbus compared with contemporary IBM PC models and their look-alikes. In May I built a Beta release for the PC-186 which had fairly widespread distribution in the UK with the early Nimbus computers sold to the RM education market of schools and colleges. After painstaking work on optimizations of the graphics driver and improvements to the Nimbus-specific DOS application switching system (all written in 16 bit assembly language), I mastered the first release candidate in late August. I don’t recall exactly when the first release version actually shipped; perhaps the answer is hidden in my attic or the RM archives.

Trivia. Windows 1 install could fit on two 3.5”/720K disks. I also created a version to run on a single 720K drive system with Windows Write, Notepad, Paint and a few other lightweight apps. However the majority of PC-186 systems ran off network servers or hard drive. Familiar features of Windows such as GDI graphics, the message pump, EXE/DLL architecture were present right at the beginning. However the original windowing system treated applications as tiles; full overlapping windows did not appear until version 2 (1987).

Windows 1 was not a commercial success for Microsoft. RM grew one of the larger installed bases among OEMs. Most PC manufacturers (including RM beyond the PC-186) embraced the quirky and limiting IBM PC hardware/BIOS compatibility design; application vendors usually worked with these primitive interfaces rather than use a hardware-independent API like Windows. A dismal state of affairs for some years. Fortunately the Apple Mac and to a lesser extent Windows and various non-Intel based machines continued to point to the future for personal computers although in late 1985, I didn’t expect it would take over four years before the best-selling Windows 3.0 (1990) established the long term shape of the Personal Computer.

A few web searches today revealed that many aspects of the evolution of personal computers appear to be well hidden. It was interesting to be involved in the emergence of the PC for a few years so I suppose I ought to return to this period occasionally to fill in some more gaps in the online record.

Postscript. The practical side of history is learning from the past. Windows itself originated at a time when the situation with early personal computers was chaotic with few standards. An array of incompatible machines faced the software developer and the early user of PC technology. Today we have a mix of new and old generation technologies with systems like iOS, Android, WP7, Kindle, Windows, OSX, Desktop Linux, Xbox, PS3, Wii etc. etc. operating in a complex connected world; each with its own different developer stories to tell. A new chaos has emerged and, I suspect, we are once again looking to redefine the meaning of personal computing.

Tuesday, 26 October 2010

Egyptian Hieroglyphs on the Web (October 2010)

One year after the release of Egyptian Hieroglyphs in Unicode 5.2 there has been some progress in making hieroglyphs usable on the web although it is still early days. I hope these notes are useful.

If you can see hieroglyphs 𓄞𓀁 in this sentence, good. Otherwise. A few notes, and you can decide whether it might be better to wait until things have moved forward a little.

Information on Egyptian Hieroglyphs in Unicode

For information on Unicode 6.0 (the latest version) see www.unicode.org/versions/Unicode6.0.0/. While the full text for 6.0 is being updated, refer to the 5.2 version www.unicode.org/versions/Unicode5.2.0/ch14.pdf section 14.17. The direct link to the chart is www.unicode.org/charts/PDF/U13000.pdf where signs are shown using the InScribe font.

The Wikipedia article en.wikipedia.org/wiki/Egyptian_hieroglyphs is fairly accurate as far as it goes, and contains hieroglyphs in Unicode which can be viewed given a suitable browser and font.

InScribeX Web still contains the largest set of material viewable online including sign list, dictionaries and tools. You need a Silverlight (or Moonlight) compatible system (the vast majority of PCs, Linux, Mac or Windows, are fine.). There is no requirement you install a font. I last updated InScribeX Web in May – yes it is about due for an update but time is the enemy (and I’d like to see Moonlight 3 released first anyway).

Browsers

Of the popular web browsers, only recent versions of Firefox display individual Unicode hieroglyphs correctly. I expect the situation will change over the next few months. Meanwhile, use Firefox if you want to explore the current state of the art.

Search

Right now, only Google search indexes Unicode hieroglyphs (and the transliteration characters introduced at Unicode 5.1 in 2008). I expect at some point next year Bing and Yahoo will be brought up to date but meanwhile stick with Google.

Fonts

A satisfactory treatment of hieroglyphs on the web really needs smart fonts installed on your computer. I’m on the case (see Simplified Egyptian: A brief Introduction) but it will take some time until all the pieces of the puzzle including browser support come together (see ISO/Unicode scripts missing in OpenType and other comments here).

Neither Apple nor Microsoft provide a suitable font at the moment as parts of, or add-ons to, iOS, OSX, or Windows.

Meanwhile, in general I can’t advise about basic free fonts to use (fonts sometimes appear on the internet without permission of copyright holders and I don't want to encourage unfair use of creative work).

I will note an ‘Aegyptus’ font is downloadable at http://users.teilar.gr/~g1951d/ – the glyphs are apparently copied from Hieroglyphica 2000. I’ve not analyzed this yet.

For InScribe 2004 users I currently have an intermediate version of the InScribe font available on request (email me on Saqqara at [Saqqara.org] with ‘InScribe Unicode Font’ in the message title – I get a lot of spam junk mail. That way I can let you know about updates.).

Asking a user to install a font to read a web page is in general a non-starter; I think the medium term future for web sites is the Web Open Font Format (WOFF) once the dust settles on the new web browser versions in development. I’ll post here about the InScribe font in this context and make examples available when the time is ripe.

Friday, 22 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 2

"Your manuscript I’ve read my friend
And like the half you’ve pilfered best;
Be sure the piece you yet may mend –
Take courage man and steal the rest."

Anon (c. 1805)

As mentioned in Of Characters and Strings (in .Net, C#, Silverlight …): Part 1 many of the text processing features of .Net are such that code can be re-used over Silverlight, WPF, XNA, Mono, Moonlight etc. In this note I want to draw attention to a few key points and differences over platforms. It is far from a comprehensive analysis and I’ve skipped some important topics including collation, the variety of ways of displaying text ,and input methods.

A quick aside: I failed to highlight in Part 1 how 16-bit confusion in software is not limited to .Net, similar issues arise in Java (although Java has some helpful UTF-32 methods that .Net lacks) and the situation is far worse in the world of C/C++.

I’ll restrict attention to Silverlight 4, Silverlight 3 for Windows Phone 7 OS (WPOS7.0), and .Net 4. In case the significance of the opening quote passes the reader by, it is to be hoped that Silverlight 5 and any new branches on the .Net tree feature increased compatibility, deprecating certain .Net 4 elements where appropriate.

.Net Text in general

Many Silverlight (and WPF) programmers rely mainly on controls and services written by others to deal with text so are unfamiliar with the lower level implications – in this case the main thing is to make sure you test input, processing, and display components with text that contains non-BMP characters, combining character sequences, and a variety of scripts and fonts. You may also like to check out different culture settings. Relying on typing querty123 on a UK/US keyboard and using the default font is a sure-fire recipe for bugs in all but the simplest scenarios.

.Net text is oriented around UTF-16 so you will be using Unicode all or almost all the time. The dev. tools, e.g. Visual Studio 2010 and Expression Blend 4, work pretty well with Unicode and cope with the basics of non-BMP characters. I find the main nuisance with the VS 2010 code editor is that only one font is used at a given time, I can’t currently combine the attractive Consolas font with fall-back to other fonts when a character is undefined in Consolas.

For portability over .Net implementations, the best bet at present is to write the first cut in Silverlight 3 since for this topic SL3 is not far off being a subset of .Net 4. A good design will separate the text functionality support from the platform-specific elements, e.g. it’s not usually a good idea to mix text processing implementation in with the VM in a MVVM design or in the guts of a custom control. I often factor out my text processing into an assembly containing only portable code.

Text display on the various platforms is a large topic in its own right with a number of issues for portability once you go beyond using standard controls in SL, WPF etc. I’ll not go into this here. Likewise with text input methods.

Versions of Unicode

It would be really useful if the various flavours and versions of .Net were assertive on text functionality, e.g. Supports Unicode 6.0 (the latest version) or whatever. By support, I mean the relevant Unicode data tables are used (e.g. scripts, character info) not to imply that all or even most functionality is available for all related languages, scripts and cultures. Currently, the reality is much more fuzzy and there are issues derived from factors such as ISO/Unicode scripts missing in OpenType. Methods like Char.IsLetter can be useful in constructing a 'version detector' to figure out what is actually available at runtime.

Char structure

The .Net Char structure is a UTF-16 code unit, not a Unicode character. Current MSDN documentation still needs further updates by Microsoft to avoid this confusion (as noted in Part 1). Meanwhile the distinction should be kept in mind by developers using Char. For instance although the method Char.IsLetter(Char) returns true if the code unit is question is a letter-type character in the BMP, IsLetter(Char) is not in general a function you would use when looking for letters in a UTF-16 array.

It is therefore often a good idea to use strings or ‘string literal’ to represent characters, in preference to Char and ‘character literal’. Inexperienced programmers or those coming from a C/C++ background may find this odd to begin with, being familiar with patterns like

for (i=0; i < chars.Length; i++) { if (chars[i]=='A' { etc. }};

Fortunately, Char provides methods to work correctly with characters, for instance Char.InLetter(String, Int32). Char.IsLetter("A",0) returns true and the function works equally well for Char.IsLetter("\u10900",0) so long as your platform supports Unicode 5.1 (the version of Unicode that introduced U+10900 𐤀 PHOENICIAN LETTER ALF).

Char is largely portable apart from a few quirks. I find it especially puzzling that Char.ConvertFromUtf32 and Char.ConvertToUtf32 are missing from SL 3/4. Faced with this I wrote my own conversions and use these even on .Net 4.0 where the methods are available in Char, this way keeping my code portable and efficient.

I also wrote a CharU32 structure for use in UTF-32 processing (along with a StringU32) rather than extend Char to contain methods like IsLetter(int) for UTF-32 (as is done in Java). Makes for cleaner algorithms and something Microsoft may want to consider for .Net futures.

The CharUnicodeInfo class in SL 3/4 is a subset, missing GetDecimalDigit, a rather specialist helper for working with decimal numbers in non-Latin scripts and in most cases one would use GetNumericValue which is portable.

CharUnicodeInfo.GetUnicodeCategory and Char.GetUnicodeCategory have subtle differences.

String class

The .Net String is a sequential collection of Char objects whose value is immutable (i.e. its value is never modified but can be replaced. Informally, String encapsulates a read-only UTF-16 array. As with Char, the MSDN documentation tends to confuse Unicode character with UTF-16 code unit.

Comments above on use of string and string literal for Char also apply here. For instance the method IndexOf(Char) can only be used to find a Char, i.e. code unit. IndexOf(String) must be used to find an arbitrary Unicode character. If you try entering '\u10900' in C# in Visual Studio 2010, you will be warned “Too many characters in character literal” a reminder .NET character literals are not quite characters and "\u10900" is needed.

Much of String is portable. I’ll just mention a few differences:

Net 4.0 has extra String methods like StartsWith(String, Boolean, CultureInfo), a method which gives more flexibility in working with multiple cultures. SL 3/4 is oriented around current culture and invariant culture so not well suited to multilingual applications.

The whole model of culture and locale in software is way too complex to go into here, I’ll just say that the old fallacies like physical presence in Iceland means a user can speak Icelandic and never spends US dollars or that individuals use only one language have no place in the 21st Century.

SL 3/4 is missing the Normalize method, a very handy function when working with Unicode.

SL 3/4 has fewer constructors available to the programmer than Net 4 but some of those missing are error prone and wouldn’t normally be used in .Net 4.

StringBuilder class

Useful, gives measureable performance improvements constructing a lot of strings.

Very portable, just a few differences, I’ve not encountered these in real life code yet. Rather odd that Append(Decimal) is missing from SL 3/4 though.

StringInfo class

StringInfo is designed to work with actual Unicode characters so is very useful. I get the impression it is less well known among programmers than it ought to be.

Largely portable. However SL 3/4 is missing the SubstringByTextElements methods and LengthInTextElements both of which are handy for processing Unicode correctly in readable code. I’ve found myself simulating something similar to gain portability, although that being said, the portable enumeration methods make for a more common pattern.

Encoding classes

Encoding classes are often used while reading or writing text streams or converting from one text format to another, usually with a class derived from Encoding: SL 3/4 only provides the UnicodeEncoding (UTF-16) and UTF8Encoding classes, in contrast to .Net 4 which also has ASCIIEncoding, UTF7Encoding, and UTF32Encoding.

ASCII and UTF-7 are understandable, omission of UTF32Encoding is hard to see as anything other than an oversight given that UTF-32 is a useful weapon in the armoury of Unicode programming (although hardly ever a good idea for file streams). One of the first things I did in the early days of Silverlight 2 beta was to write conversion code. I hope Microsoft will add this to Silverlight 5, although for portability reasons it may be years before we can settle down with it.

The base Encoder class in .Net 4 has quite a number of extra methods and properties for Mail and Browser purposes as well as codepage handling. I tend to agree with the decision in SL to drop some of this complexity and discourage use of legacy stream formats even if that means occasionally custom code is required to convert ‘Windows ANSI’, ‘Mac OS Roman’ and the ISO/IEC 8859 series of 8 bit encodings.

The main features of Utf8Encoding such as BOM handling and exceptions are portable. Unicode normalization features are missing from SL 3/4 again, as with the String class.

Not the most portable part of .Net.

Tuesday, 19 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 1

“The time has come,” the Walrus said,
“To talk of many things:
Of shoes—and ships—and sealing-wax—
Of characters—and strings—
And why the sea# is boiling hot—
And whether pigs have wings.”

(With apologies to Lewis Carroll, the Walrus, and the Carpenter).

During discussion of my comments ISO/Unicode scripts missing in OpenType on the Unicode mailing list, the point came up about desirability of greater understanding of Unicode among programmers and others involved with software development. For a start, there is one popular myth to dispel, the subject of this post which I hope to be the first of several notes on Unicode in .Net.

Myth debunk: a Unicode character is neither a cabbage nor a 16 bit code.

The origin of 16-bit confusion lies in the history of Unicode. Twenty years ago there were two initiatives underway to replace the already out-dated and problematic variety of 7/8-bit character encodings used to represent characters in modern scripts. A true Babel of ‘standard’ encodings back then made it impractical to write software to work with the worlds writing systems without a tremendous level of complexity. Unicode was originally conceived as a 16 bit coding to replace this mess. Meanwhile, the International Organization for Standardization (ISO) was working on ISO 10646 the ‘Universal Character Set’ UCS with space for many more characters than a 16-bit encoding has room for. The original ISO proposals for encoding were widely regarded as over complex so the ISO/Unicode approaches were merged by the time Unicode 2.0 was released in 1996. ISO 10646 now defines the Universal Character Set for Unicode. With unification, the notion of 16-bit characters became obsolete although a 16-bit encoding method remains (UTF-16) along with the popular 8-bit coding (UTF-8) and a 32-bit coding (UTF-32). Each encoding has its virtues. UTF stands for Unicode Transformation Format.

To understand what constitutes the Unicode notion of ‘character’, refer to http://www.unicode.org/versions/Unicode6.0.0/ (or the earlier version while the text of 6.0 is being completed). I will try to summarize briefly.

1. An abstract character is a unit of information for representation, control or organization of textual data. A Unicode abstract character is an abstract character encoded by the Unicode standard. Abstract characters not directly encoded in Unicode may well be capable of being represented by a Unicode combining character sequence. Each Unicode abstract character is assigned a unique name. Some combining sequences are also given names in Unicode, asserting their function as abstract characters.
2. A Unicode encoded character can be informally thought of as an abstract character along with its assigned Unicode code point (an integer in the range 0 to 10FFFF hexadecimal, the Unicode codespace). As noted above it is also assigned a unique name.
3. A Unicode character or simply character is normally used as shorthand for the term Unicode encoded character.

Here are two useful ways of describing Unicode characters:

U+006D LATIN SMALL LETTER M
U+13000 EGYPTIAN HIEROGLYPH A001
U+1F61C FACE WITH STUCK-OUT TONGUE AND WINKING EYE

And similar with the actual character displayed

U+006D – m – LATIN SMALL LETTER M
U+13000 – 𓀀 – EGYPTIAN HIEROGLYPH A001
U+1F61C – 😜 – FACE WITH STUCK-OUT TONGUE AND WINKING EYE

The first form is often preferable in scenarios where font support might not be present to display the actual character although on this blog I prefer to use the characters to encourage font diversity.

Note the conventional use of hexadecimal to state the value of the Unicode code point. This convention is different to that used in HTML where characters as numeric entities are written using decimal numbers rather than hexadecimal, e.g. 𓀀 (13000 hexadecimal equals 77824 decimal).

From a programming perspective, the simplest way of representing Unicode is UTF-32 where each code point fits comfortably into a 32 bit data structure, e.g. in C# a uint or int (C/C++ programmers note C# defines as 32 bit, the size does not vary with CPU register size). Not entirely trivial because there may still be combining sequences. However UTF-32 is not used all that much in practice, not least because of memory cost.

Nowadays, most files containing Unicode text use UTF-8 encoding. UTF-8 uses 1 byte (octet) to encode the traditional 127 ASCII characters and up to 4 bytes to encode other characters. XML and HTML files are popular file formats that use Unicode (Mandatory in XML, optional in HTML where a surprising amount of the web, possibly 50%, still uses legacy encodings). I strongly recommend UTF-8 for text files rather than UTF-16 or legacy 8-bit encodings aka code pages etc. Having worked on several multilingual content-intensive projects, this is the golden rule, although I won’t expand further today on the whys and wherefores. [However I ought to mention the catch that is the ‘Byte order mark’, a byte sequence (0xEF, 0xBB, 0xBF) sometimes used at the start of a UTF-8 stream to assert UTF-8 not legacy text; this can confuse the novice particularly with ‘.txt’ files which can be Unicode or legacy. Windows Notepad uses BOM for Unicode text files. Visual Studio 2010 also uses BOM to prefix data in many file types including XML, XAML and C# code.]

UTF-16 is very popular with software writers working in C/C++ and .Net languages such as C#. A version of UTF-16 was the standard data format for Unicode 1.0. Unicode characters with character codes less than 0x10000 are said to belong to the Unicode BMP (Basic Multilingual Plane) and these are represented by one 16 bit number in UTF-16, other characters require two 16 bit numbers i.e. two UTF-16 codes from a range that do not encode characters, the so called surrogate code points dedicated to this purpose. As of Unicode 6.0, fewer than 50% of characters belong to the BMP but BMP characters account for a huge proportion of text in practice. This is by design; all popular modern languages have most script/writing system requirements addressed by the BMP and there are even specialist scripts such as Coptic defined here. Processing UTF-16 is often more efficient than UTF-8 and in most cases uses half the memory of UTF-32, all in all a good practical compromise solution.

Which brings me back to the 16-bit myth. The fact that so many popular characters belong to the BMP and only require one code unit in UTF-16 means it is easy to be mistaken into thinking most means all. The problem doesn’t even arise with UTF-8 and UTF-32 but the fact is much software uses UTF-16, indeed UTF-16 is essentially the native text encoding for Windows and .Net.

Example sources of 16-bit confusion:

The article on character sets at http://www.microsoft.com/typography/unicode/cs.htm is brazen:

This article is dated to 1997 but was probably written much earlier. Windows NT 3.1 (1993) was notable as the first computer operating system to use Unicode as its native text encoding and Microsoft deserves credit for this, alongside Apple who also did much to help early uptake of Unicode (but would not have a new operating system until OSX was released in 2001). I’m quoting this as an example of the fact that there are many old documents on the Web, confusing even when from reputable sources. I should mention, in contrast, much of MSDN (and indeed much of the relevant information on Wikipedia) is pretty up to date and reliable although not perfect on this subject.

The definition of the .Net Char structure on MSDN, http://msdn.microsoft.com/en-us/library/system.char.aspx, is much more recent.

Er, no. Char is not a Unicode character. It is a 16 bit Unicode code unit in UTF-16. Actually, this is explained later on in the Char documentation but the headline message is confusing and encourages programmers to use Char inappropriately.

The reasons I chose the Microsoft examples rather than the myriad of other confusing statements on the web are twofold. Firstly I'm focussing on .Net, C# etc. here. Secondly, Microsoft are generally ahead of the game with Unicode compared with other development systems which makes errors stand out more.

Fact is .Net actually works very well for software development with Unicode. The basic classes such as 'String' are Unicode (String is UTF-16) and it is almost true to say it is harder to write legacy than modern.

I had hoped to get a little further on the actual technicalities of working with Unicode characters and avoiding 16-bit pitfalls but time has proved the enemy. Another day.

Just three useful (I hope) points on .Net to conclude.

1. Code that works with String and Char should avoid BMP-thinking, e.g. if you want to parse a String, either avoid tests like IsLetter(Char) or wrap their usage in logic that also handles surrogates.

2. String, Char and the useful StringInfo class belong to the System namespaces and are pretty portable over the gamut of .Net contexts including Silverlight, WPF, XNA as well as the Novell parallel universe with Mono, MonoTouch, Moonlight etc. With a little care it can be straightforward to write text processing code that works across the board to target Windows, Mac, Linux, WP7 and whatever comes next.

3. Always test text-related code with strings that include non-BMP characters, and preferably also with data that includes combining sequences and usage instances of OpenType features such as ligatures.

Wednesday, 13 October 2010

Unicode 6.0 released: Let the challenge begin

Unicode 6.0.0 was released yesterday, October 12th 2010. This is a major update to Unicode - version 5.0 was released in 2006, followed by partial updates 5.1 (2008) and 5.2 (2009). Details are given at http://unicode.org/versions/Unicode6.0.0/, also see the Unicode, Inc. press release Unicode 6.0: Support for Popular Symbols in Asia. The Unicode character repertoire reflects the ISO/IEC 10646:2010 standard for characters, Unicode itself adding much of the technical information needed for implementation of writing systems.

All of which gobbeldy-gook masks the fact that Unicode is a rather wonderful thing, not only a valuable technology but also a work of art and beauty that eclipses much that passes for establishment and celebrity art of modern times in my personal opinion. Our world continues a path towards English as the lingua franca for Planet Earth with a clear decline in the relevance of traditional languages. Yet at the same time the technology that may be seen by some as a threat can also be the saviour. Unicode is the keystone. It is marvellous fact that 5000 years of diverse writing systems can become assessible to all for the first time in history and Unicode has played a pivotal role in making this happen during its 20 year evolution.

A specification is only a starting point. The complete text of Unicode 6.0 is still being revised for publication next year. I recently drew attention to ISO/Unicode scripts missing in OpenType and have since been informed that work is now underway to catch up on the missing scripts. Nevertheless it can be expected that it will take months and years before computer software and digital content catches up.

A fun addition to Unicode is the set of ‘Popular Symbols in Asia’ mentioned above. Emoticons. Here are four examples:

U+1F601 – 😁 – GRINNING FACE WITH SMILING EYES
U+1F63B – 😻 – SMILING CAT FACE WITH HEART-SHAPED EYES
U+1F64A – 🙊 – SPEAK NO EVIL MONKEY
U+1F64C – 🙌 – PERSON RAISING BOTH HANDS IN CELEBRATION

I suspect Emoticons will be the popular motivator for timely support of Unicode 6.0 by the usual corporate suspects (Apple, Google, Microsoft etc.). Meanwhile expect your web browser to show ‘unknown character’ for GRINNING FACE WITH SMILING EYES etc. above.

Search engines. When I checked this morning, neither Bing nor Google search were indexing the Emoticons, I’ll keep my eyes open to report on who wins that particular race.

Internet browser support. Internet Explorer is currently the most popular Internet browser and version 9 is currently in Beta. The standards based approach of IE9 and the promise of improved compatibility among new releases of all browsers though HTML5 support etc. is a very positive direction for the web. Firefox is the second most popular. Then Safari and Chrome, both webkit based. The level of Unicode 6.0 support in the major IE9 and Firefox 4 releases (expected in the first half of next year) may serve as one interesting predictor of directions in the Browser wars.

There are no especially strong motivators in the traditional Desktop software arena but the situation is different for newer device formats. Which of Android, Windows Phone 7, or iOS will support Emoticons first? What about eReaders? Silverlight, Flash. Traditionally, support for new versions of Uncode has been slow in coming but seems like the rules are different now.

Should make for an interesting 2011.

Footnote. Much of the content of Unicode is there because a large number of individuals have freely given their time simply because of the worthwhile nature of the project. (I don’t understand why the big picture has not captured the imagination of the wealthy people of our world.) In particular I’d like to mention Michael Everson (who I worked with on Egyptian Hieroglyphs and Transliteration) who deserves recognition for his many years of effort and a dogged determination to take Unicode beyond the short term requirements of commercial and national interests.

Sunday, 10 October 2010

The Real Truth about 42

Today is Sunday, 10/10/10, an appropriate day to reflect on the number 42. I’d better explain for the sake of the sanity of non-mathematicians that binary 101010 is in fact the number 42 in disguise.

An obvious feature of 42 is its prime factorization: 2x3x7. Obvious can be boring so I'll add the more obscure fact that the sum 2x3 + 2x7 + 3x7 = 41, just one less than 42. I don’t know if anyone has named the class of numbers whose pair-wise sum of its prime factors plus 1 equals the number itself. That sum emerged en route on a visit to Hove so if anyone really needs a name how about a ‘Hove number’? Not an exceptional inspiration, but possibly a brand new observation to add to a large literature on the topic of 42.

More ‘fascinating’ facts about 42 can be found at Wikipedia - 42 (number) where I learned Charles Dodgson, writing as Lewis Carroll, was also fond of 42. Perhaps it’s in the Oxford tap water. Computer programmers may be amused by the fact that the wildcard character ‘*’ has character code 42 in ASCII and Unicode.

Truth is, the number 42 has been regarded as special for (probably) over 5000 years.

Traditionally, Ancient Egypt was divided into administrative districts, usually called ‘nomes’ nowadays (from the Greek word for ‘district’, Νομός; also Egyptian spꜣt/𓈈/etc.). Curiously when I placed ideograms for the 20 nomes of Lower Egypt and 22 nomes of Upper Egypt into the first draft for the (as then) proposed Unicode standard for Egyptian Hieroglyphs, it was only afterwards that 42 clicked ‘not that number AGAIN’. I expect the fame of 42 goes back to the dawn of writing and mathematics itself.

Thoth, a (male) Egyptian deity (Egyptian Ḏḥwty; 𓅝, 𓁟 etc.), was associated with wisdom, magic, writing, mathematics, astronomy and medicine. Maat, a (female) deity (Egyptian Mꜣꜥt; 𓁦, 𓁧 etc.) was associated with truth, equilibrium, justice and order. She represents a fundamental concept in Ancient Egyptian philosophy. In some later traditions which featured male-female pairing between deities, Thoth and Maat were linked together (although rarely in a romantic sense). Both deities are prominent in the judging of the deceased as featured in the ‘Book of the Dead’.

The Papyrus of Ani gives a list of 42 ‘negative confessions’ for the deceased – “I have not committed sin”, “I have not murdered” etc. The ‘Ten Commandments’ of the Old Testament can be thought of as a condensed version. Sometimes referred to as ‘the doctrine of Maat’. 42 associated deities, supervised by Thoth, were assigned to the judgment of the deceased during his or her passage through the underworld.

I can’t resist mentioning that the modern name “Book of the Dead” was invented by Karl Richard Lepsius (the Egyptian rw nw prt m hrw has been more literally translated as the ‘Spells of Coming Forth by Day’ or similar). It can be no more than coincidence that the publication in question, “Das Todtenbuch der Ägypter nach dem hieroglyphischen Papyrus in Turin mit einem Vorworte zum ersten Male Herausgegeben” was published in 1842. Lepsius was a major and influential figure during the emergence of the modern discipline of Egyptology as well as being responsible for the creation of the first hieroglyphic typeface as implemented by typographer Ferdinand Theinhardt, the “Thienhardt font”.

The ’42 Books of Thoth’ aka ’42 Books of Instructions’ were composed from around 3rd century BC supposedly based on earlier traditions. Only fragments remain from this Hermetic text which apparently contained books on philosophy, mathematics, magic, medicine, astronomy etc. A legendary source, highly influential in later traditions of mysticism, alchemy, occultism and magic. The 42 Books have been believed by some to contain the hidden key to the mysteries of immortality and the secrets of the Universe. A fruitful topic I guess for Dan Brown and other writers of fiction.

Trivia. Visiting the South Coast last December, I was amused to discover the return rail-fare from Oxford was £42. Got me thinking how often 42 has cropped up in my life. Coincidence can be good fun. I decided to keep an eye open for incidents involving near neighbours of 42: 40, 41, 43, and 44. A prospect so intriguing and exciting I’m surprised I woke up on the approach to a snow and ice encrusted Hove before the train rattled on its way to Worthing. I can now report the scientifically meaningless result after 10 months ‘research’. Those worthy siblings 40, 41, 43, 44 just don’t cut the mustard compared with their famous colleague. Perhaps it’s just me. Although when my son started at secondary school this September, there was a certain inevitability about his reply when asked in what number classroom his form was based. For a moment I thought he was kidding.

I can't really leave the topic without mentioning the obvious.

The writer most credited for the prominence of 42 in modern times is the late Douglas Adams. In his radio series “Hitchhikers Guide to the Galaxy” (BBC Radio 4, 1978), the “Answer to the Ultimate Question of Life, the Universe, and Everything” is calculated to be 42. The meme exploded. Adams later claimed to have picked 42 pretty much at random.

We will never know whether Adams knew of the antiquity of 42 as a profound and famous number, indeed as the answer to his very own ultimate question. Its easy to speculate that he must have held some knowledge, at least at some subconscious forgotten level. A remarkable coincidence otherwise, unless 42 is in fact the answer.

Yet not impossible. After all there is something rather cute and appealing about 42. She still looks good for her age. Don’t you think so too?

Saturday, 9 October 2010

ISO/Unicode scripts missing in OpenType

Unicode 6.0 release is imminent (see www.unicode.org), a year after the release of Unicode 5.2 (October 2009). Version 6.0 introduces three new scripts: Mandaic, Batak, and Brahmi. There are extensions to other scripts and many other improvements and clarifications.

An aside to anyone involved in HTML5 standardisation. It would be a really good idea if Unicode 6.0 compatibility were specified as part of the formal standard for HTML, and included in conformance testing.

OpenType is the de-facto standard for font technology and as such an essential part of implementating a script. The latest set of script tags (codes) for OpenType is given at www.microsoft.com/typography/otspec/scripttags.htm (document last updated in January 2008 when checked today).

The current ISO-15924 list of script codes is given at www.unicode.org/iso15924/iso15924-codes.html.

Unfortunately, some Unicode scripts are missing from the OpenType script tag list. This is long overdue an update.

The fact that Unicode 5.2 has not been incorporated in OpenType specifications a year after release makes for an unsatisfactory situation. I am writing to those concerned and encourage others to do likewise.

The following 15 Unicode scripts are missing from OpenType:

Avestan (134, Avst, Unicode 5.2)
Bamum (435, Bamu, Unicode 5.2)
Batak (365, BatkUnicode 6.0)
Brahmi (300, Brah, Unicode 6.0)
Egyptian hieroglyphs (050, Egyp, Unicode 5.2)
Imperial Aramaic (124, Armi, Unicode 5.2)
Kaithi (317, Kthi, Unicode 5.2)
Lisu (Fraser) (399, Lisu, Unicode 5.2)
Mandaic, Mandaean (140, Mand, Unicode 6.0)
Old Turkic, Orkhon Runic (175, Orkh, Unicode 5.2)
Inscriptional Pahlavi (131, Phli, Unicode 5.2)
Inscriptional Parthian (230, Prti, Unicode 5.2)
Samaritan (123, Samr, Unicode 5.2)
Old South Arabian (105, Sarb, Unicode 5.2)
Tai Viet (359, Tavt, Unicode 5.2)

As a footnote. Not available in Unicode yet, but of interest to Egyptology are:

Meroitic Hieroglyphs (100, Mero, formal proposal with WG2)
Meroitic Cursive (101, Merc, formal proposal with WG2)
Egyptian Hieratic (060, Egyh, no formal proposal yet , contact me if you have any ideas)
Egyptian Demotic (070, Egyd, no formal proposal yet, contact me if you have any ideas)

There are also some desirable additions to be made to Egyptian Hieroglyphs (I'd like to see something with ISO/WG2 in 2012 if not before).

Thursday, 30 September 2010

Simplified Egyptian: A brief Introduction

I coined the term ‘Simplified Egyptian’ several years ago as a technical approach to making Ancient Egyptian in hieroglyphs more useable in the modern digital world (see HieroglyphsEverywhere.pdf, Bob Richmond, 2006).

The snag in creating an implementation has long been external factors such as the status of web browsers and Word Processors, along with the associated de-facto or formal industry standards. The devil is in the detail and there are many idiosyncrasies in modern technology once one departs from the everyday. A notion like Simplified Egyptian would be no more than a curiosity if it were not widely accessible on personal computers and other digital devices.

One factor in the equation was the need to include Egyptian Hieroglyphs in the Unicode standard (published in Unicode 5.2, October 2009). Implementations of 5.2 are slowly becoming available; for instance Google web search now accepts hieroglyphs although Microsoft Bing and Yahoo search do not yet. Another key factor is support by Internet browsers. Firefox 3.6 looks viable now and I expect the latest versions of other popular browsers to support Egyptian to some degree within the next few months.

As various pieces of the technical puzzle appear to be coming together in the 2011 timeframe I thought it would be useful to summarise now what I see Simplified Egyptian being about. I envisage putting more flesh on the bones in future blog posts on a prototype implementation as leisure time permits (this is an unfunded project at present so time is the enemy).

Simplified Egyptian (SE) works as follows.

1. Define a subset of the Unicode 5.2 list of characters for Egyptian Hieroglyphs, avoiding variants and rarely used (in Middle Egyptian) characters.
2. Define fixed rules for combining hieroglyphs into groups so these rules can be implemented in TrueType/OpenType fonts or alternative rendering methods.
3. Use left to right writing direction.
4. Define data tables and algorithms for text manipulation and sorting.
5. Define ‘normalized forms’ for guidance on ‘correct’ ways of writing and processing Simplified Egyptian.

A more recent notion - Super-Simplified Egyptian (SSE) - takes these principles further by identifying an even more condensed subset of the Hieroglyphic script, a proper subset of SE, with a palette of fewer than 200 hieroglyphs.

There is no question that the SE method is highly anachronistic, SSE extremely so. Nevertheless, there is some utility in the approach.

I am also aware that superficially what I’m proposing suggests a flavour of modernised Egyptian at odds with the requirements of Egyptology for working with an ancient language with script usage that evolved over 3000 years. I will make no apologies for the fact that this is indeed one application and if SE encourages wider understanding of Egyptian albeit at a reduced technical level that is no bad thing in my opinion. Nevertheless, the most interesting aspect from my own interest is the question of how to use such a mechanism to enable improvements for academically sound publication and study of ancient texts in the context of 3000 years of language/script evolution. A non-trivial topic I shall not touch on further today.

My plan is to make available some small working examples of Simplified Egyptian on a series of web pages during the next few weeks. These examples use a WOFF (Web Open Font Format) font derived from my InScribe font. The reasons for doing this now are twofold.

1. A new generation of Internet Browsers pays greater attention to industry standards and should be capable of supporting Simplified Egyptian. Firefox 4 and Internet Explorer 9 are in Beta at the moment and I want to make the samples available for browser testing in case there is a need to shake out any browser bugs.

2. InScribe 3 for Windows will not use Simplified Egyptian. Originally it was my intention that SE would be a feature but it turned out to introduce too many complications in modes of use. Nevertheless, InScribe 3 retains some ‘SE-friendly’ characteristics and I want to be able to test these for real on the Web as I complete work on the software.

The samples will not work for users of older browser technology (right now this is a high ninety something per cent of internet capable devices). My short term concern is only that an elegant and simple to use implementation works as and when devices gain adequate support for internet standards.

That is not to say that workarounds can’t be contrived for devices whose manufacturers or users are not able or prepared to adapt to the new standards-based internet landscape. I'm happy to hear of any proposals.

Right now, this means use Firefox 3.6 or later to view samples as intended. I’m also tracking Chrome, Internet Explorer 9 Beta and Safari releases.

Wednesday, 29 September 2010

Browser of the month: Firefox

My post yesterday Quick test for Ancient Egyptian in web browsers (September 2010) actually exposed three bugs.

None of which involved Firefox. In fact Firefox 3.6 and later correctly display transliterations and hieroglyphs on a Windows system with a suitable Unicode 5.2 font containing hieroglyphs and the other characters.

The bugs are:
1. Latest releases of Chrome, Internet explorer (8 and 9 Beta) and Safari do not pick up that there is a local font with hieroglyphs. Basically a bug with Unicode 5.2 support I think. Attn: Apple, Google, Microsoft.
2. The same three browsers incorrectly process characters in the SMP given as character references e.g. &#55308;&#56639; Firefox is correct in displaying this pair as two bad characters per HTML specifications. This UTF-16 type of surrogate representation is not valid HTML: in my example the correct character reference is 𓄿 Attn: Apple, Google, Microsoft.
3. The Blogspot post editing software gratuitiously changed my UTF-8 text into character references &#55308;&#56639; for no apparent reason. The editing software also is buggy when I try to re-edit the post. Attn: Google, Blogger

So a gold star to Mozilla/Firefox.

The wooden spoon ought to go to Google for hitting all three bugs but in mitigation I'll observe that hieroglyphs are now supported by Google search (unlike the situation with Microsoft Bing and Yahoo who haven't even caught up with Unicode 5.1 never mind 5.2. A tribute to corporate lethargy - wake up guys).

After discovering the Blogger bug, I've opened a secondary blog on WordPress - Journal of Total Obscurity 2. For the time being this remains my main blog but WordPress will be used for posts with hieroglyphs.

I've retained my original post here 'Quick test for Ancient Egyptian in web browsers (September 2010)' so bugs can be monitored but I've uploaded the correct version at Test page for Ancient Egyptian Hieroglyphs in Unicode (September 2010) on WordPress.

Tuesday, 28 September 2010

Quick test for Ancient Egyptian in web browsers (September 2010)

A quick test note to check Ancient Egyptian in Web browsers.

If you have a (Unicode 5.2 compatible) Egyptian font installed on your system, the next few lines ought to make sense:

ꜣꜢiIꜥꜤwWbBpPfFmMnNrRhHḥḤḫḪẖH̱sSšŠḳḲkKgGtTṯṮdDḏḎ

(in MdC this Egyptian transliteration reads +taa*AA*iIwWbBpPfFmMnNrRhh*HH*xx*XX*ss*SS*qq*kKgGtt*TT*dd*DD*)

𓄿𓇋𓏭𓂝𓅱𓃀𓊪𓆑𓅓𓈖𓂋𓉔𓎛𓐍𓄡𓋴𓈙𓈎𓎡𓎼𓏏𓍿𓂧𓆓

(in MdC these Egyptian hieroglyphs read +s-A-i-y-a-w-b-p-f-m-n-r-h-H-x-X-s-S-q-k-g-t-T-d-D)

In fact, this is a FAIL for hieroglyphs today on Windows for Chrome (6.0.472.63), Firefox (3.6.10) Internet Explorer 9 (Beta 9.0.7930), and Safari (5.0.2). Only Firefox successfully displays the transliteration.

Tantalizingly, the Firefox edit box does work:

Technically, all a browser needs to do is ennumerate all fonts on the host system and if the font implicit in the HTML is not present, use any font that supports the characters if available. Perhaps there needs to be some magic setting in the TrueType fonts for the browsers to work although this ought not to be necessary so I will count this as a multi-browser bug.

The lines should read:

Update. This site, Blogger, turned my HTML hieroglyph strings into entities, e.g. hieroglyphs in UTF-8 into &#55308;&#56639; etc. Firebox has a bug in this case (entities in Unicode SMP) but not when raw UTF-8 is used in HTML so Firefox is very close to working, indeed it is good for many web pages. Blogger is a bit broken, the entities are simply confusing and bring nothing to the party.

Monday, 27 September 2010

Q. Why is my screen too small?

A. It’s a tradition.

Last month I mentioned the 25 year old RM Nimbus PC-186 and its 640x250 display. 250 was the number of lines displayable on a ‘CGA’ class CRT monitor of that time (more precisely at 50/60Hz non-interlaced). The 14" CGA was the only mass produced monitor available at a reasonable price in 1985 and it was that fact as much as the cost of the driver electronics that influenced the low resolution choice of display mode. By 1988, 14" ‘VGA’ type monitors were in mass production at 640x480 resolution and these soon gained higher definition 800x600.

During 1987-1993 one part of my job with RM involved working with a series of US Silicon Valley based companies who were growing the capabilities of PC graphics systems into the affordable market. Computer graphics has always been a personal interest so it was fun to be involved in bringing out the then emerging technology that is nowadays is taken for granted. My main role was writing device drivers for Windows and working with the chip designers to boost performance. During this period, the ‘holy grail’ was to reach 1024x768 24bit colour with an inexpensive design, a point we reached for the first time with a Cirrus Logic chip in 1993. This hit acceptable performance goals for Windows 3.1, removing the need for the transient 256 colour type displays popular for a while but problematic from an application programming point of view.

Two flies in the ointment. 1. Computer monitor manufacturers took a long time to come around to the obvious fact that 14"/15" CRT displays were too small for applications like word processing and spreadsheets. The sweet spot was 17"/19" but it seemed to take forever before it was accepted this was a volume market and the price benefits of mass production held sway (21" and above were cool but too unwieldy in CRT except for specialist applications such as CAD). 2. Most employers, schools and universities regarded it as acceptable to save a hundred dollars or so even if that meant seeing armies of highly paid employees and students hunched over small monitors peering at a fraction of a spreadsheet or page of text.

So much for history, though I’ll repeat the point that it seems to be a well-established tradition to use displays that are too small for purpose. Eventually things get better so nowadays good flat screen displays for desktop computers are very affordable. Although last year I visited the newsroom of a popular newspaper and it was almost laughable to see journalists and typesetters using displays that were obviously too small to efficiently work with a tabloid format. All to save the cost of a lunch or two!

Moores Law in the twenty first century. Electronics shrinking to give high functionality with reduced power consumption, and the consequential growth of small format computing: laptops, netbooks, smartphones, tablets, eBook readers. In each case the same pattern. Early devices have less than usable screen sizes and not just for reasons of manufacturing cost. Product marketing tries to avoid the fact that the emperor has no clothes. Keen leading edge users, in denial, claim its all ok. Markets learn and devices gradually move to something more ergonomic and pleasant to use.

This topic came to mind while I was tweaking an InScribe design for netbooks (typically a usable 1024x600 10" display nowadays after that unfortunate early fad for the 90s-retro 800x480 resolution on 7" in 2008). Reading today’s announcement of the upcoming RIM ‘PlayBook’ device (7" LCD, too small for its aspirations in my opinion. See the, I expect ill-fated, Dell Streak.). Not that 7"/8" is a bad format for many purposes (Note to Amazon with the 6” Kindle. And Sony. Try measuring a paperback book!).

Incidentally whatever the flaws in the first generation iPad, 9.7" is not dramatically smaller than the optimum size for purpose so kudos to Apple for bucking the usual pattern (although I personally think 11-12" touchscreen hits the right compromise between portability and function).

So if you find yourself peering at the internet through a 3.5" supposedly state of the art smartphone remember that for users to suffer for a while is a tradition, you are paying the price of being a part of history in the making, and things will soon get better (better for smartphones I suspect means about a 4.2-4.5" with narrow bezel compromise in current tech).

PS. Inches not metric; another tradition.

Thursday, 23 September 2010

Document embedding and OLE

There are a number of technical issues concerned with what I'm attempting to accomplish with the InScribe software. The biggest thorn in my side is the issue of embedding, visualising, and editing embedded data in compound documents. This is therefore something of a background note on the topic.

The notion of embedding or linking an ‘object’ in a document is commonplace. Web pages incorporate pictures, videos and specialized objects such as Flash or Silverlight interactive components. Word processing documents likewise contain objects, sometimes interactive objects, alongside the text. The idea of ‘compound document’ goes back to the 1980s.

The current situation with embedded objects is chaotic. Examples. Microsoft Office and Open Office, the two most popular office suites, have different schemes for add-ins. Even limiting attention to one vendor, Microsoft Office has added features in the 2003, 2007 then 2010 editions: all well and good but makes it difficult to support a diverse user base (Office 2003 is still widely used). Firefox, Chrome, Internet Explorer each has its own plugin approach and the story gets more complicated when considering non-Windows Internet browsers.

We are sorely missing standard, flexible, open approaches to embedding. On the web side of the coin, HTML5 is a move in the right direction though in itself no panacea despite what some less technically minded commentators may say on the subject (a topic for a future blog entry!).

A concrete example. There is no bottom line hardware-related reason nowadays that a simple photo editor plug-in software component could not operate on devices as diverse as eBook readers, smartphones, tablet computers, as well as notebook and desktop computers. Such a plug-in could enrich many applications, not only web browsing but word processors and camera management tools. In our Babel like world of 2010, to accomplish such a thing in software would involve writing versions for iOS, OS X, Android, Windows XP, Windows 7, Symbian, Kindle, Blackberry, Gnome, Wii... the list goes on. The developer then needs to tackle application-specific rules for plug-ins (if such exist). A relatively simple piece of software is almost impossible to deploy widely.

I am not advocating one ring to rule them all, simply highlighting the severe lack of standard ways for applications and components to interoperate and deploy.

In fact, for Microsoft Windows applications there has been one solution for almost 20 years. OLE (Object Linking and embedding) was introduced to Windows in 1990 and expanded substantially to version 2 in 1993. Parts of the OLE system were renamed ActiveX controls in 1996. I’ve exploited OLE with the InScribe software to enable in-place editing of Ancient Egyptian in Word Processing and other applications.

OLE has never been an ideal technology solution. Parts are overcomplicated and error prone during development. OLE design is too specific to classic Windows architecture. There was originally insufficient attention given to security issues although this is now largely addressed. Nevertheless, for Windows applications, OLE provided some solutions to the big problem of making applications and components work together by defining standard rules for interoperability of functions like embedding and compound documents.

Unfortunately, OLE development by Microsoft pretty much stopped well over a decade ago, more a victim of fashion rather than any logical reason I suspect. One side effect is that something as fundamental as how copy and paste works between Windows applications is still stuck in a 1990s time-warp.

As the internet grew and new technologies such as Java and .Net became available, inasmuch as there was any attempt to address application interoperability, solutions tended to be product specific. Microsoft themselves targetted ‘the enterprise’ with less focus on the general personal computer user. Rather than an OLE philosophy where third parties can expand the capabilities of Windows and its applications in a general way, Microsoft Office became focussed on enterprise oriented Office specific add-ons. Linux and other alternatives failed to rise to the challenge of developing a more open and flexible approach.

OLE continues to be supported in Windows and Office, but pretty much maintenance only. OpenOffice and other products continue likewise. I have not looked at the latest Adobe Creative Suite (CS5) but recall an earlier move around CS3 to remove some OLE functionality. Microsoft Office Word 2010 runs OLE embedding in a compatibility mode. I don’t expect OLE to disappear in the next few years but it is certainly becoming less usable.

If anyone reading this understands why the issue of application interoperability and OLE type functionality is missing from .Net and WPF I'd be delighted to find out.

This slow decline of OLE and the lack of practical modern alternatives proved a stumbling block in my development of a new version of the InScribe for Windows software. I expect other developers are in a similar situation of having to make some undesirable compromises in order to get a product released. On a positive note, the problem has stimulated a number of interesting ideas for future directions and I hope to touch on these here during the next few months.

Meanwhile a call for anyone working on document embedding to please learn from the past. Lets try to ensure whatever is latest and greatest at least accomplishes what OLE did 20 years ago (and still almost does).

Tuesday, 24 August 2010

Mangaglyphics

Apparently the Japanese word manga (katakana マンガ, kanji 漫画, hiragana まんが) can be loosely translated to English as “whimsical pictures”. Distinctive manga styles have seen growing popularity outside Japan during the last few decades predominantly through comic book, cartoon, and video game formats.

Some time ago I had the crazy notion that there are some interesting ways to combine the tradition of ancient Egyptian Hieroglyphs with manga styles in an entertaining way, mangaglypic seemed like the word rather than the equally obvious hieromanga. Fun, possibly with some educational value.
However with the work that needs to be done on improving accessibility to non-whimsical applications of ancient Egyptian on personal computers and other devices mangaglyphic is pretty low on my software to do list.

So why mention the term right now? Partly because it looks like a mangaglyph or two are creeping unasked into the InScribeX Web user interface. Partly because I’d be delighted to hear from artists or others experimenting with this style of image. However what actually stimulated my writing today was discovering the search engine bing.com still returns zero results for mangaglyphic or related words and google.com returns only one result. So in the unlikely event the term catches on at all I wanted to state mangaglyphic is meant to be a generic word. No attempts to register trademarks etc. please.

Thursday, 12 August 2010

My first Home PC – recollections of the RM Nimbus PC-186

Despite having recently mentioned the Jupiter Ace and Sinclair Spectrum, early hobbyist computers, I am happy to admit to never owning either of those devices being far too impatient an individual to work with cassette tapes and the like. So I was never an early adopter of computers outside work and my personal home computer journey began in 1984.

The RM Nimbus PC-186 was released sometime in early 1985 - I recall demonstrating a beta-release of Windows (version 1) on the Nimbus to journalists visiting BETT 85 (British Educational Training and Technology Show). Incidentally BETT (www.bettshow.com) has grown to be the major annual event for information technology in UK Education, now occupying the huge Olympia exhibition centre in London for several days each January. But I digress.

Released just around the time that ‘IBM compatibility’ became fashionable for Personal Computer design, the Nimbus used an Intel Processor (80186) to run MS-DOS (3.x) but made no attempt to match the hardware compatibility points of an IBM PC. This was not unusual in the early 1980s; here in the UK, non-IBM-compatible MS-DOS computers from Apricot Computers were popular in some commercial sectors around the same time.

RM (then officially Research Machines Limited, www.rm.com/) was already established as one of the leading suppliers to UK education with the RM 380Z and 480Z computers, both 8 bit machines using CP/M with Z80 microprocessor. These were among the first small computers to make substantial use of networking; the 480Z was a very early example of a diskless workstation on a Local Area Network (LAN).

Working with RM at the time, I was fortunate to get my hands on one of the first batch of Nimbus prototypes in 1984 so my first home PC was this ugly but functional box with unfinished casework, an item for the study not the living room.

The Nimbus was the first 16 bit computer from RM. The combination of a faster processor than the IBM PC (8Mhz 80186 as against 4.77Mhz 8088), a larger maximum memory space (960K v 640K), and use of 3.5" 720Kb disks (v 5.25” 360Kb) raised some interest. A unique feature was ‘Piconet’, a serial interface for peripherals, a kind of early forerunner of USB, a good idea but before its time and let down by the performance of its implementation. My machine had a 10Mb hard drive, a luxury at a time when systems often had to make do with floppies or a LAN server. The Nimbus was a modest commercial success. I have no sales numbers to hand but certainly over 100K systems shipped to UK schools before RM adopted mainstream IBM compatible 286 then 386 based systems. The main competition in education was the Acorn/BBC microcomputer. Unlike the situation in North America, Apple never made much progress in the UK Education market largely due to high prices compared with Acorn and DOS based systems.

The Nimbus display was unique – a ‘high’ resolution graphics mode of 640x250 pixels with only 4 colours (black, white and a choice of the other two from the usual 14 suspects, I usually settled for red and blue). In modern terms that sounds like a nightmare. Indeed the reality was worse, just not quite so drab as the monochrome 640x200 (CGA) graphics used on the IBM PC or the 512x342 Apple Macintosh of the same era. A more colourful 320x250 display mode was used by most educational software for the Nimbus but I rarely used this 16-colour mode personally, my home computer activities largely involving software development, word processing and spread sheet applications. My campaign for usable graphics on personal computers was already underway by then but that is another story.

My software 1984/5. TXED, the RM full screen text editor, was useful. I created the Nimbus ports of Microsoft Word (for DOS) and Microsoft Multiplan (the DOS-based predecessor to Excel) for RM. This combination made for basic PC office productivity applications and it was bundled as such with a range of Nimbus configurations. For software development, I learned C and the new C++ programming language. However assembly language programming was still crucial for these low memory slow systems. My largest assembly project was an adaptation of Microsoft Windows 1.x to the Nimbus.

It is a reflection on the limited applications at that time I suppose that this development work was about the most entertaining aspect of home computing in my early personal experience. Fortunately for children and teachers in schools, there grew a useful and sometimes fun catalogue of educational software and simple games for the Nimbus.

After around 18 months with the Nimbus PC-186 as my home computer, I upgraded to an 80286 IBM compatible for most activities although my Nimbus hung around for several years (and several revisions of Windows up to and including the 1990 release 3.0).

An interesting footnote on those early years of software development for Microsoft Windows. Curiously enough, the extra memory available over the IBM PC architecture meant that Windows on the Nimbus had about twice the memory available for applications once DOS and Windows fixed overheads were taken into account. It took until 1988 and early alpha versions of Windows 3.0 to enable an IBM compatible to win out in the memory stakes.

Saturday, 7 August 2010

Missing in Silverlight 4: a functional GlyphTypeface class

Warning. Obscurity level: HIGH.

This note is primarily aimed at the Silverlight development team in Microsoft Redmond. Other Silverlight developers may also want to understand a limitation of Silverlight 4.

Background
Applications that require superscripts, subscripts, and other rich text functionality need control of character/glyph positional placement. Advanced typography is also useful in applications such as e-book readers where it is often desirable to accurately represent the look and feel of the book. Specialist applications that do mathematical typography (and my ancient Egyptian work) need this kind of precision. From a developer perspective, it is the GlyphTypeface class in the .Net/WPF System.Windows.Media namespace that provides much of required functionality for WPF applications.

The problem
The Silverlight 4 documentation available from Microsoft (see http://msdn.microsoft.com/en-us/library/system.windows.media.glyphtypeface(VS.95).aspx) states

“The GlyphTypeface object is a low-level text object that corresponds to a single face of a font family as represented by an OpenType font file, or serialized as a block of memory in a document. Each glyph defines metrics that specify how it aligns with other glyphs. The correct GlyphTypeface to use for a run of characters in a given logical font is normally determined by the Silverlight font system.
The GlyphTypeface object provides properties and methods for the following:
· Obtaining font face common metrics, such as the ratio of ascent and descent to em size.
· Obtaining metrics, outlines, and bitmaps for individual glyphs.”

Er no! The WPF 4 version of GlyphTypeface indeed does this. However Silverlight 4 only supports reading the name of the font and its version number. All the useful functionality is missing. The documentation quoted only applies to WPF.

It is therefore impossible in general to implement advanced typography in Silverlight. A big hole - typography has been possible since Windows 3.1, the first release (1992) to incorporate scalable (TrueType) fonts. [Note: sure there are clumsy workarounds in very special circumstances but I won’t go into those today].

The solution
Expand the GlyphTypeface in Silverlight 5 to provide all missing functionality except where this conflicts for some reason with the Silverlight security model. In particular, discovery of the black box for a glyph is essential, as is CharacterToGlyphMap (without which the ‘Glyphs’ class has only limited use). A fairly small amount of straightforward work in the Silverlight runtime yields a big benefit to third party developers and should also help functional enhancement to controls such as RichTextBox.

Note: Windows Phone 7 is also lacking functionality here.

Tuesday, 3 August 2010

Ace Computers (Pilot and Jupiter)

In the interests of the online documentation of obscure connections.

Several years ago, I attended a celebration of the life of Alan Turing (1912-1954) at King's College Cambridge. Turing is recognized nowadays as an influential pioneer in the history of Computer Science and Computers. His key contributions to British code breaking work at Bletchley Park during World War II have historical significance. At the King's event, it was fascinating to meet several of the people who had worked with Turing; a reminder of how young computer technology really is.

This recollection came to mind recently while reading an article (How Alan Turing's Pilot ACE changed computing) on the BBC website. The Pilot Ace was an early computer designed by Turing and developed at the British National Physical Laboratory (NPL) from 1946 to its release in 1950. For several years, the Pilot Ace was used for commercial applications and can now be seen in the London Science Museum. The BBC article refers to a radio interview with Tom Vickers, operations manager on the Pilot Ace project (As of writing, the interview is still available via Harriet Vickers blog where the interview starts 11 minutes into the podcast).

An early home computer called the Jupiter Ace was released in 1982, an untypical device of its time based around the FORTH programming system, an approach that yielded a little more efficiency than similar machines of that era which were mostly programmed using interpreted BASIC.

Computing technology had advanced considerably since the era of the Pilot Ace 30 years earlier but the two devices in their own way illustrate the challenge of trying to work with hardware not quite ready for prime time yet interesting nevertheless. A basic Jupiter Ace came with 1024 bytes (1K) of main memory, expandable to 49x1024 bytes (49K). The Pilot Ace originally had 512 bytes main memory, later expanded to 1408 bytes (implemented using vacuum tubes!) with a 16K byte drum memory peripheral.

Now for the obscure part. The Jupiter Ace ROM software was written by Steve Vickers. Steve had already written much of the ROM software for the Sinclair ZX home computers, popular hobbyist type machines in the UK of the early 1980s. The name "Jupiter Ace" was inspired by the work of his father Tom Vickers on the Pilot Ace. Generations.

It is now 60 years since the Pilot Ace release. Almost 30 since the Jupiter Ace and the computing landscape has changed far more dramatically in the last three decades than in the first three following Turing's work. A myriad of observations could be made but I'll simply note that commonplace telephones nowadays have a billion times more memory than the Pilot, and hundreds of thousands more than the Jupiter.

Perhaps it is time for another Ace computer.

Thursday, 17 June 2010

C++ vs. C# 2010 ... and the winner is ...

As far as I know there is no World Cup for programming languages, quadrennial or otherwise. No FIFA++ or FIFA# either. But, if there was...

I have been helping for several months with a project aiming to commercialise an optimising technique for microelectronic circuits discovered as part of some academic research at a local university. Who would predict that the mathematics of Galois polynomials would have an industrial application?! Well, this is my journal of total obscurity.

Writing the software. Apart from user interface, visualisation and usability all good stuff, a key element of the project is of course the algorithms, some of which require a lot of computational processing by current standards and can be time consuming (potentially days rather than hours or minutes).

The original research software used a C++ with STL implementation and much of the time I continued using C++ for algorithm work as an adaption of the original despite the fact I’m far more C# oriented nowadays. Mental exercise is a good thing. My Windows software design was a hybrid with the user interface and control functionality in C#/.Net/WPF with algorithms packaged in a DLL implemented in C++. Standard .Net interop linked the two components together.

I suspect most developers with strong experience in C++ and C# will agree that a modern language such as C# can be far more readable than C++. However a common concern is the efficiency of C# on the .Net and Mono environments using the Common Library Runtime (CLR) compared with C++ static compilation into native machine code. There is no question that C and C++ work closer to the metal and their compilers generate very efficient code in the right hands (C has been around since 1973 and C++ since 1983). The C/C++ combination remains the most efficient programming choice for some applications.

Question. What happened when I rewrote a key algorithm from C++ into C# using the same logic?

Answer. Performance improved by an order of magnitude. In fact my initial benchmarks were so close to exactly 10x faster I felt obliged to check and recheck. Memory usage for the C# version was about 5% less than C++ for this specific problem.

[In both instances C++ and C# I used Microsoft Visual Studio 2010 and .Net 4 runtimes but have since replicated similar figures on Linux using GNU CPP for C++ and MonoDevelop 2.2 with Mono 2.6.4 for C#.].

Understanding the why. Partly down to inefficiencies in the C++ heap and STL templates - software uses hashset/unordered_map/vector objects. Partly benefits of automatic lifecycle management of C# objects with a built in garbage collector. A little benefit from C# language features such as the foreach construction.

OK, I’m sure that by adjusting algorithms and replacing STL constructs I could bring C++ up to a similar or higher speed than the C# version. Plenty can be said on that topic. But life is too short for such antics especially when the C# version is more concise and readable that a C++ version could ever be.

Back in 2006 Italy lifted the trophy, C# and CLR implementations were not as mature, C++ held sway. Times change.

Now, Summer 2010, to my mind C# takes the top of the podium. Weighing the evidence, fact not opinion. Performance and elegance, it is so often a trade off but this, my recent experiment, sealed the question for me.

Postscript. I wrote this note to encourage other developers to look again at the rationale of choosing C++ over a modern programming language even for computationally intensive tasks. The usual caveats. Especially 1. existing systems where language change is impractical and interop inappropriate and 2. the limited class of problem where C++ still rules. Personally speaking. I'll continue to choose C or C++ where there is a clear benefit but the problem domain where this is true seems to have shrunk dramatically.

Thursday, 20 May 2010

Lost in Hieroglyphs

One evening in February 2006, I noticed there had been a huge number of downloads of a draft specification I'd written about the encoding of Egyptian Hieroglyphs in the Unicode Private Zone (EGPZ=EGyptian in the Private Zone). Over 10,000 downloads within a week. A little detective work yielded the discovery that some hieroglyphs had appeared in the TV series 'Lost' and in the thirst for knowledge ... well, goodness knows what the legion of Losties made of such a dry document.

A proposal for Basic Egyptian Hieroglyphs in Unicode by Michael Everson and myself was starting to take shape at that time. I had also begun drafting a presentation entitled Hieroglyphs Everywhere for the Informatique et Egyptologie (I&E 2006) conference being held in Oxford that Summer. So the timing of the 'Lost event' was coincidental, indeed an encouragement to explore further the notion of making Ancient Egyptian more accessible in popular culture beyond the academic dimension.

Four years after. The EGPZ specification was released in 2006, The Unicode 5.2 Standard (October 2009) now contains Basic Egyptian Hieroglyphs. The process of making Ancient Egyptian more accessible continues. I&E 2010 is being held in Liege this July and I've just started writing a minor revision of the EGPZ specification and a followup to the Hieroglyphs Everywhere talk.

And 'Lost' is coming to an end on Tuesday 23rd May after six seasons. Perhaps the full significance of the statue of Taweret will be revealed. Most likely not and thats a good thing in my opinion, the world is a better place for some notions to remain wrapped in mystery.

Then there is the fact that 42 is one of the 'Lost Numbers'. The mathematics and science of coincidence. Another day!

Monday, 10 May 2010

InScribeX Web Preview 3 released

I have just released Preview 3 of the InScribeX Web software on http://www.inscribex.com/. This version replaces Preview 2 for Windows and Mac users and works with Silverlight version 3 or 4. Linux users will probably want to stick with Preview 2 which runs with Moonlight 2 for the time being (see note below).

As illustrated, the user interface has been changed to require less screen space. This is very useful on low resolution displays, especially those found on netbooks. I have also chosen this two page view for the dictionaries so English-Egyptian and Egyptian to English can be viewed simultaneously (although it is probable that additional ways of working with the dictionaries will follow at some point).

Some features I had hoped to include in Preview 3 have been deferred in order that the software works with the current pre-release of Moonlight 3 (Moonlight is the equivalent to Silverlight for Linux systems). I hope to update Preview 3 over the summer to track Moonlight development and make a few additions and changes to functionality, the most interesting being to add some basic UMdC editing features and include some revised dictionary content.

Preview 3 is about 25% smaller than preview 2 so loads faster over the web.

Coming soon ... InScribe Web Preview 4

Preview 4 is being developed in parallel to Preview 3 and I've adopted a development approach to allow components to be shared between the two versions. This sounds rather complicated but makes sense from my development perspective as part of the strategy of making InScribeX cross-platform over a range of computers and other devices. For the majority of Windows and Mac users, all this means is you should use Preview 3 for the time being then switch to Preview 4 when it is available (best guess sometime this summer).

Preview 4 takes advantage of new features in Silverlight 4 to enable printing and rich text editing of Egyptian texts among other enhancements. Watch this space.

InScribe Web on Linux

Moonlight 2 was released in December 2009 as a Linux FireFox plugin (this can be downloaded for popular modern Linux distributions from www.go-mono.com/moonlight/download.aspx). Moonlight 2 enables InScribe Web Preview 2 operation on Linux systems.

Pre-release 'alpha quality' Moonlight 3 plugins for Firefox and Chrome browsers on Linux can be downloaded from go-mono.com/moonlight/prerelease.aspx. InScribe Web Preview 2 appears to work as with Moonlight 2. InScribe Web Preview 3 mostly appears to run okay on the most recent (April) plugin versions. However one unavoidable problem at the moment is the full dictionaries take an extremely long time to load. I've therefore limited the dictionaries to 100 entries under Linux for the time being until the Moonlight bug is fixed (a good reason to stick with InScribe Web Preview 2). I'm planning to track Moonlight 3 pre-release versions towards release, updating Preview 3 if necessary and feasible.

All being well, Moonlight 3 will be released by Novell by Autumn with full Silverlight 3 compatibility so I can retire InScribe Web Preview 2 leaving Preview 3 a fully cross platform solution for Windows/Mac/Linux.