Tuesday 26 October 2010

Egyptian Hieroglyphs on the Web (October 2010)

One year after the release of Egyptian Hieroglyphs in Unicode 5.2 there has been some progress in making hieroglyphs usable on the web although it is still early days. I hope these notes are useful.

If you can see hieroglyphs 𓄞𓀁 in this sentence, good. Otherwise. A few notes, and you can decide whether it might be better to wait until things have moved forward a little.

Information on Egyptian Hieroglyphs in Unicode

For information on Unicode 6.0 (the latest version) see www.unicode.org/versions/Unicode6.0.0/. While the full text for 6.0 is being updated, refer to the 5.2 version www.unicode.org/versions/Unicode5.2.0/ch14.pdf section 14.17. The direct link to the chart is www.unicode.org/charts/PDF/U13000.pdf where signs are shown using the InScribe font.

The Wikipedia article en.wikipedia.org/wiki/Egyptian_hieroglyphs is fairly accurate as far as it goes, and contains hieroglyphs in Unicode which can be viewed given a suitable browser and font.

InScribeX Web still contains the largest set of material viewable online including sign list, dictionaries and tools. You need a Silverlight (or Moonlight) compatible system (the vast majority of PCs, Linux, Mac or Windows, are fine.). There is no requirement you install a font. I last updated InScribeX Web in May – yes it is about due for an update but time is the enemy (and I’d like to see Moonlight 3 released first anyway).

Browsers

Of the popular web browsers, only recent versions of Firefox display individual Unicode hieroglyphs correctly. I expect the situation will change over the next few months. Meanwhile, use Firefox if you want to explore the current state of the art.

Search

Right now, only Google search indexes Unicode hieroglyphs (and the transliteration characters introduced at Unicode 5.1 in 2008). I expect at some point next year Bing and Yahoo will be brought up to date but meanwhile stick with Google.

Fonts

A satisfactory treatment of hieroglyphs on the web really needs smart fonts installed on your computer. I’m on the case (see Simplified Egyptian: A brief Introduction) but it will take some time until all the pieces of the puzzle including browser support come together (see ISO/Unicode scripts missing in OpenType and other comments here).

Neither Apple nor Microsoft provide a suitable font at the moment as parts of, or add-ons to, iOS, OSX, or Windows.

Meanwhile, in general I can’t advise about basic free fonts to use (fonts sometimes appear on the internet without permission of copyright holders and I don't want to encourage unfair use of creative work).

I will note an ‘Aegyptus’ font is downloadable at http://users.teilar.gr/~g1951d/ – the glyphs are apparently copied from Hieroglyphica 2000. I’ve not analyzed this yet.

For InScribe 2004 users I currently have an intermediate version of the InScribe font available on request (email me on Saqqara at [Saqqara.org] with ‘InScribe Unicode Font’ in the message title – I get a lot of spam junk mail. That way I can let you know about updates.).

Asking a user to install a font to read a web page is in general a non-starter; I think the medium term future for web sites is the Web Open Font Format (WOFF) once the dust settles on the new web browser versions in development. I’ll post here about the InScribe font in this context and make examples available when the time is ripe.

Friday 22 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 2

"Your manuscript I’ve read my friend
And like the half you’ve pilfered best;
Be sure the piece you yet may mend –
Take courage man and steal the rest."

Anon (c. 1805)

As mentioned in Of Characters and Strings (in .Net, C#, Silverlight …): Part 1 many of the text processing features of .Net are such that code can be re-used over Silverlight, WPF, XNA, Mono, Moonlight etc. In this note I want to draw attention to a few key points and differences over platforms. It is far from a comprehensive analysis and I’ve skipped some important topics including collation, the variety of ways of displaying text ,and input methods.

A quick aside: I failed to highlight in Part 1 how 16-bit confusion in software is not limited to .Net, similar issues arise in Java (although Java has some helpful UTF-32 methods that .Net lacks) and the situation is far worse in the world of C/C++.

I’ll restrict attention to Silverlight 4, Silverlight 3 for Windows Phone 7 OS (WPOS7.0), and .Net 4. In case the significance of the opening quote passes the reader by, it is to be hoped that Silverlight 5 and any new branches on the .Net tree feature increased compatibility, deprecating certain .Net 4 elements where appropriate.

.Net Text in general

Many Silverlight (and WPF) programmers rely mainly on controls and services written by others to deal with text so are unfamiliar with the lower level implications – in this case the main thing is to make sure you test input, processing, and display components with text that contains non-BMP characters, combining character sequences, and a variety of scripts and fonts. You may also like to check out different culture settings. Relying on typing querty123 on a UK/US keyboard and using the default font is a sure-fire recipe for bugs in all but the simplest scenarios.

.Net text is oriented around UTF-16 so you will be using Unicode all or almost all the time. The dev. tools, e.g. Visual Studio 2010 and Expression Blend 4, work pretty well with Unicode and cope with the basics of non-BMP characters. I find the main nuisance with the VS 2010 code editor is that only one font is used at a given time, I can’t currently combine the attractive Consolas font with fall-back to other fonts when a character is undefined in Consolas.

For portability over .Net implementations, the best bet at present is to write the first cut in Silverlight 3 since for this topic SL3 is not far off being a subset of .Net 4. A good design will separate the text functionality support from the platform-specific elements, e.g. it’s not usually a good idea to mix text processing implementation in with the VM in a MVVM design or in the guts of a custom control. I often factor out my text processing into an assembly containing only portable code.

Text display on the various platforms is a large topic in its own right with a number of issues for portability once you go beyond using standard controls in SL, WPF etc. I’ll not go into this here. Likewise with text input methods.

Versions of Unicode

It would be really useful if the various flavours and versions of .Net were assertive on text functionality, e.g. Supports Unicode 6.0 (the latest version) or whatever. By support, I mean the relevant Unicode data tables are used (e.g. scripts, character info) not to imply that all or even most functionality is available for all related languages, scripts and cultures. Currently, the reality is much more fuzzy and there are issues derived from factors such as ISO/Unicode scripts missing in OpenType. Methods like Char.IsLetter can be useful in constructing a 'version detector' to figure out what is actually available at runtime.

Char structure

The .Net Char structure is a UTF-16 code unit, not a Unicode character. Current MSDN documentation still needs further updates by Microsoft to avoid this confusion (as noted in Part 1). Meanwhile the distinction should be kept in mind by developers using Char. For instance although the method Char.IsLetter(Char) returns true if the code unit is question is a letter-type character in the BMP, IsLetter(Char) is not in general a function you would use when looking for letters in a UTF-16 array.

It is therefore often a good idea to use strings or ‘string literal’ to represent characters, in preference to Char and ‘character literal’. Inexperienced programmers or those coming from a C/C++ background may find this odd to begin with, being familiar with patterns like

for (i=0; i < chars.Length; i++) { if (chars[i]=='A' { etc. }};

Fortunately, Char provides methods to work correctly with characters, for instance Char.InLetter(String, Int32). Char.IsLetter("A",0) returns true and the function works equally well for Char.IsLetter("\u10900",0) so long as your platform supports Unicode 5.1 (the version of Unicode that introduced U+10900 𐤀 PHOENICIAN LETTER ALF).

Char is largely portable apart from a few quirks. I find it especially puzzling that Char.ConvertFromUtf32 and Char.ConvertToUtf32 are missing from SL 3/4. Faced with this I wrote my own conversions and use these even on .Net 4.0 where the methods are available in Char, this way keeping my code portable and efficient.

I also wrote a CharU32 structure for use in UTF-32 processing (along with a StringU32) rather than extend Char to contain methods like IsLetter(int) for UTF-32 (as is done in Java). Makes for cleaner algorithms and something Microsoft may want to consider for .Net futures.

The CharUnicodeInfo class in SL 3/4 is a subset, missing GetDecimalDigit, a rather specialist helper for working with decimal numbers in non-Latin scripts and in most cases one would use GetNumericValue which is portable.

CharUnicodeInfo.GetUnicodeCategory and Char.GetUnicodeCategory have subtle differences.

String class

The .Net String is a sequential collection of Char objects whose value is immutable (i.e. its value is never modified but can be replaced. Informally, String encapsulates a read-only UTF-16 array. As with Char, the MSDN documentation tends to confuse Unicode character with UTF-16 code unit.

Comments above on use of string and string literal for Char also apply here. For instance the method IndexOf(Char) can only be used to find a Char, i.e. code unit. IndexOf(String) must be used to find an arbitrary Unicode character. If you try entering '\u10900' in C# in Visual Studio 2010, you will be warned “Too many characters in character literal” a reminder .NET character literals are not quite characters and "\u10900" is needed.

Much of String is portable. I’ll just mention a few differences:

Net 4.0 has extra String methods like StartsWith(String, Boolean, CultureInfo), a method which gives more flexibility in working with multiple cultures. SL 3/4 is oriented around current culture and invariant culture so not well suited to multilingual applications.

The whole model of culture and locale in software is way too complex to go into here, I’ll just say that the old fallacies like physical presence in Iceland means a user can speak Icelandic and never spends US dollars or that individuals use only one language have no place in the 21st Century.

SL 3/4 is missing the Normalize method, a very handy function when working with Unicode.

SL 3/4 has fewer constructors available to the programmer than Net 4 but some of those missing are error prone and wouldn’t normally be used in .Net 4.

StringBuilder class

Useful, gives measureable performance improvements constructing a lot of strings.

Very portable, just a few differences, I’ve not encountered these in real life code yet. Rather odd that Append(Decimal) is missing from SL 3/4 though.

StringInfo class

StringInfo is designed to work with actual Unicode characters so is very useful. I get the impression it is less well known among programmers than it ought to be.

Largely portable. However SL 3/4 is missing the SubstringByTextElements methods and LengthInTextElements both of which are handy for processing Unicode correctly in readable code. I’ve found myself simulating something similar to gain portability, although that being said, the portable enumeration methods make for a more common pattern.

Encoding classes

Encoding classes are often used while reading or writing text streams or converting from one text format to another, usually with a class derived from Encoding: SL 3/4 only provides the UnicodeEncoding (UTF-16) and UTF8Encoding classes, in contrast to .Net 4 which also has ASCIIEncoding, UTF7Encoding, and UTF32Encoding.

ASCII and UTF-7 are understandable, omission of UTF32Encoding is hard to see as anything other than an oversight given that UTF-32 is a useful weapon in the armoury of Unicode programming (although hardly ever a good idea for file streams). One of the first things I did in the early days of Silverlight 2 beta was to write conversion code. I hope Microsoft will add this to Silverlight 5, although for portability reasons it may be years before we can settle down with it.

The base Encoder class in .Net 4 has quite a number of extra methods and properties for Mail and Browser purposes as well as codepage handling. I tend to agree with the decision in SL to drop some of this complexity and discourage use of legacy stream formats even if that means occasionally custom code is required to convert ‘Windows ANSI’, ‘Mac OS Roman’ and the ISO/IEC 8859 series of 8 bit encodings.

The main features of Utf8Encoding such as BOM handling and exceptions are portable. Unicode normalization features are missing from SL 3/4 again, as with the String class.

Not the most portable part of .Net.

Tuesday 19 October 2010

Of Characters and Strings (in .Net, C#, Silverlight …): Part 1

“The time has come,” the Walrus said,
“To talk of many things:
Of shoes—and ships—and sealing-wax—
Of characters—and strings—
And why the sea# is boiling hot—
And whether pigs have wings.”

(With apologies to Lewis Carroll, the Walrus, and the Carpenter).

During discussion of my comments ISO/Unicode scripts missing in OpenType on the Unicode mailing list, the point came up about desirability of greater understanding of Unicode among programmers and others involved with software development. For a start, there is one popular myth to dispel, the subject of this post which I hope to be the first of several notes on Unicode in .Net.

Myth debunk: a Unicode character is neither a cabbage nor a 16 bit code.

The origin of 16-bit confusion lies in the history of Unicode. Twenty years ago there were two initiatives underway to replace the already out-dated and problematic variety of 7/8-bit character encodings used to represent characters in modern scripts. A true Babel of ‘standard’ encodings back then made it impractical to write software to work with the worlds writing systems without a tremendous level of complexity. Unicode was originally conceived as a 16 bit coding to replace this mess. Meanwhile, the International Organization for Standardization (ISO) was working on ISO 10646 the ‘Universal Character Set’ UCS with space for many more characters than a 16-bit encoding has room for. The original ISO proposals for encoding were widely regarded as over complex so the ISO/Unicode approaches were merged by the time Unicode 2.0 was released in 1996. ISO 10646 now defines the Universal Character Set for Unicode. With unification, the notion of 16-bit characters became obsolete although a 16-bit encoding method remains (UTF-16) along with the popular 8-bit coding (UTF-8) and a 32-bit coding (UTF-32). Each encoding has its virtues. UTF stands for Unicode Transformation Format.

To understand what constitutes the Unicode notion of ‘character’, refer to http://www.unicode.org/versions/Unicode6.0.0/ (or the earlier version while the text of 6.0 is being completed). I will try to summarize briefly.

1. An abstract character is a unit of information for representation, control or organization of textual data. A Unicode abstract character is an abstract character encoded by the Unicode standard. Abstract characters not directly encoded in Unicode may well be capable of being represented by a Unicode combining character sequence. Each Unicode abstract character is assigned a unique name. Some combining sequences are also given names in Unicode, asserting their function as abstract characters.
2. A Unicode encoded character can be informally thought of as an abstract character along with its assigned Unicode code point (an integer in the range 0 to 10FFFF hexadecimal, the Unicode codespace). As noted above it is also assigned a unique name.
3. A Unicode character or simply character is normally used as shorthand for the term Unicode encoded character.

Here are two useful ways of describing Unicode characters:

U+006D LATIN SMALL LETTER M
U+13000 EGYPTIAN HIEROGLYPH A001
U+1F61C FACE WITH STUCK-OUT TONGUE AND WINKING EYE

And similar with the actual character displayed

U+006D – m – LATIN SMALL LETTER M
U+13000 – 𓀀 – EGYPTIAN HIEROGLYPH A001
U+1F61C – 😜 – FACE WITH STUCK-OUT TONGUE AND WINKING EYE

The first form is often preferable in scenarios where font support might not be present to display the actual character although on this blog I prefer to use the characters to encourage font diversity.

Note the conventional use of hexadecimal to state the value of the Unicode code point. This convention is different to that used in HTML where characters as numeric entities are written using decimal numbers rather than hexadecimal, e.g. &#77824; (13000 hexadecimal equals 77824 decimal).

From a programming perspective, the simplest way of representing Unicode is UTF-32 where each code point fits comfortably into a 32 bit data structure, e.g. in C# a uint or int (C/C++ programmers note C# defines as 32 bit, the size does not vary with CPU register size). Not entirely trivial because there may still be combining sequences. However UTF-32 is not used all that much in practice, not least because of memory cost.

Nowadays, most files containing Unicode text use UTF-8 encoding. UTF-8 uses 1 byte (octet) to encode the traditional 127 ASCII characters and up to 4 bytes to encode other characters. XML and HTML files are popular file formats that use Unicode (Mandatory in XML, optional in HTML where a surprising amount of the web, possibly 50%, still uses legacy encodings). I strongly recommend UTF-8 for text files rather than UTF-16 or legacy 8-bit encodings aka code pages etc. Having worked on several multilingual content-intensive projects, this is the golden rule, although I won’t expand further today on the whys and wherefores. [However I ought to mention the catch that is the ‘Byte order mark’, a byte sequence (0xEF, 0xBB, 0xBF) sometimes used at the start of a UTF-8 stream to assert UTF-8 not legacy text; this can confuse the novice particularly with ‘.txt’ files which can be Unicode or legacy. Windows Notepad uses BOM for Unicode text files. Visual Studio 2010 also uses BOM to prefix data in many file types including XML, XAML and C# code.]

UTF-16 is very popular with software writers working in C/C++ and .Net languages such as C#. A version of UTF-16 was the standard data format for Unicode 1.0. Unicode characters with character codes less than 0x10000 are said to belong to the Unicode BMP (Basic Multilingual Plane) and these are represented by one 16 bit number in UTF-16, other characters require two 16 bit numbers i.e. two UTF-16 codes from a range that do not encode characters, the so called surrogate code points dedicated to this purpose. As of Unicode 6.0, fewer than 50% of characters belong to the BMP but BMP characters account for a huge proportion of text in practice. This is by design; all popular modern languages have most script/writing system requirements addressed by the BMP and there are even specialist scripts such as Coptic defined here. Processing UTF-16 is often more efficient than UTF-8 and in most cases uses half the memory of UTF-32, all in all a good practical compromise solution.

Which brings me back to the 16-bit myth. The fact that so many popular characters belong to the BMP and only require one code unit in UTF-16 means it is easy to be mistaken into thinking most means all. The problem doesn’t even arise with UTF-8 and UTF-32 but the fact is much software uses UTF-16, indeed UTF-16 is essentially the native text encoding for Windows and .Net.

Example sources of 16-bit confusion:

The article on character sets at http://www.microsoft.com/typography/unicode/cs.htm is brazen:



This article is dated to 1997 but was probably written much earlier. Windows NT 3.1 (1993) was notable as the first computer operating system to use Unicode as its native text encoding and Microsoft deserves credit for this, alongside Apple who also did much to help early uptake of Unicode (but would not have a new operating system until OSX was released in 2001). I’m quoting this as an example of the fact that there are many old documents on the Web, confusing even when from reputable sources. I should mention, in contrast, much of MSDN (and indeed much of the relevant information on Wikipedia) is pretty up to date and reliable although not perfect on this subject.

The definition of the .Net Char structure on MSDN, http://msdn.microsoft.com/en-us/library/system.char.aspx, is much more recent.



Er, no. Char is not a Unicode character. It is a 16 bit Unicode code unit in UTF-16. Actually, this is explained later on in the Char documentation but the headline message is confusing and encourages programmers to use Char inappropriately.

The reasons I chose the Microsoft examples rather than the myriad of other confusing statements on the web are twofold. Firstly I'm focussing on .Net, C# etc. here. Secondly, Microsoft are generally ahead of the game with Unicode compared with other development systems which makes errors stand out more.

Fact is .Net actually works very well for software development with Unicode. The basic classes such as 'String' are Unicode (String is UTF-16) and it is almost true to say it is harder to write legacy than modern.

I had hoped to get a little further on the actual technicalities of working with Unicode characters and avoiding 16-bit pitfalls but time has proved the enemy. Another day.

Just three useful (I hope) points on .Net to conclude.

1. Code that works with String and Char should avoid BMP-thinking, e.g. if you want to parse a String, either avoid tests like IsLetter(Char) or wrap their usage in logic that also handles surrogates.

2. String, Char and the useful StringInfo class belong to the System namespaces and are pretty portable over the gamut of .Net contexts including Silverlight, WPF, XNA as well as the Novell parallel universe with Mono, MonoTouch, Moonlight etc. With a little care it can be straightforward to write text processing code that works across the board to target Windows, Mac, Linux, WP7 and whatever comes next.

3. Always test text-related code with strings that include non-BMP characters, and preferably also with data that includes combining sequences and usage instances of OpenType features such as ligatures.

Wednesday 13 October 2010

Unicode 6.0 released: Let the challenge begin

Unicode 6.0.0 was released yesterday, October 12th 2010. This is a major update to Unicode - version 5.0 was released in 2006, followed by partial updates 5.1 (2008) and 5.2 (2009). Details are given at http://unicode.org/versions/Unicode6.0.0/, also see the Unicode, Inc. press release Unicode 6.0: Support for Popular Symbols in Asia. The Unicode character repertoire reflects the ISO/IEC 10646:2010 standard for characters, Unicode itself adding much of the technical information needed for implementation of writing systems.

All of which gobbeldy-gook masks the fact that Unicode is a rather wonderful thing, not only a valuable technology but also a work of art and beauty that eclipses much that passes for establishment and celebrity art of modern times in my personal opinion. Our world continues a path towards English as the lingua franca for Planet Earth with a clear decline in the relevance of traditional languages. Yet at the same time the technology that may be seen by some as a threat can also be the saviour. Unicode is the keystone. It is marvellous fact that 5000 years of diverse writing systems can become assessible to all for the first time in history and Unicode has played a pivotal role in making this happen during its 20 year evolution.

A specification is only a starting point. The complete text of Unicode 6.0 is still being revised for publication next year. I recently drew attention to ISO/Unicode scripts missing in OpenType and have since been informed that work is now underway to catch up on the missing scripts. Nevertheless it can be expected that it will take months and years before computer software and digital content catches up.

A fun addition to Unicode is the set of ‘Popular Symbols in Asia’ mentioned above. Emoticons. Here are four examples:

U+1F601 – 😁 – GRINNING FACE WITH SMILING EYES
U+1F63B – 😻 – SMILING CAT FACE WITH HEART-SHAPED EYES
U+1F64A – 🙊 – SPEAK NO EVIL MONKEY
U+1F64C – 🙌 – PERSON RAISING BOTH HANDS IN CELEBRATION

I suspect Emoticons will be the popular motivator for timely support of Unicode 6.0 by the usual corporate suspects (Apple, Google, Microsoft etc.). Meanwhile expect your web browser to show ‘unknown character’ for GRINNING FACE WITH SMILING EYES etc. above.

Search engines. When I checked this morning, neither Bing nor Google search were indexing the Emoticons, I’ll keep my eyes open to report on who wins that particular race.

Internet browser support. Internet Explorer is currently the most popular Internet browser and version 9 is currently in Beta. The standards based approach of IE9 and the promise of improved compatibility among new releases of all browsers though HTML5 support etc. is a very positive direction for the web. Firefox is the second most popular. Then Safari and Chrome, both webkit based. The level of Unicode 6.0 support in the major IE9 and Firefox 4 releases (expected in the first half of next year) may serve as one interesting predictor of directions in the Browser wars.

There are no especially strong motivators in the traditional Desktop software arena but the situation is different for newer device formats. Which of Android, Windows Phone 7, or iOS will support Emoticons first? What about eReaders? Silverlight, Flash. Traditionally, support for new versions of Uncode has been slow in coming but seems like the rules are different now.

Should make for an interesting 2011.

Footnote. Much of the content of Unicode is there because a large number of individuals have freely given their time simply because of the worthwhile nature of the project. (I don’t understand why the big picture has not captured the imagination of the wealthy people of our world.) In particular I’d like to mention Michael Everson (who I worked with on Egyptian Hieroglyphs and Transliteration) who deserves recognition for his many years of effort and a dogged determination to take Unicode beyond the short term requirements of commercial and national interests.

Sunday 10 October 2010

The Real Truth about 42

Today is Sunday, 10/10/10, an appropriate day to reflect on the number 42. I’d better explain for the sake of the sanity of non-mathematicians that binary 101010 is in fact the number 42 in disguise.

An obvious feature of 42 is its prime factorization: 2x3x7. Obvious can be boring so I'll add the more obscure fact that the sum 2x3 + 2x7 + 3x7 = 41, just one less than 42. I don’t know if anyone has named the class of numbers whose pair-wise sum of its prime factors plus 1 equals the number itself. That sum emerged en route on a visit to Hove so if anyone really needs a name how about a ‘Hove number’? Not an exceptional inspiration, but possibly a brand new observation to add to a large literature on the topic of 42.

More ‘fascinating’ facts about 42 can be found at Wikipedia - 42 (number) where I learned Charles Dodgson, writing as Lewis Carroll, was also fond of 42. Perhaps it’s in the Oxford tap water. Computer programmers may be amused by the fact that the wildcard character ‘*’ has character code 42 in ASCII and Unicode.

Truth is, the number 42 has been regarded as special for (probably) over 5000 years.

Traditionally, Ancient Egypt was divided into administrative districts, usually called ‘nomes’ nowadays (from the Greek word for ‘district’, Νομός; also Egyptian spꜣt/𓈈/etc.). Curiously when I placed ideograms for the 20 nomes of Lower Egypt and 22 nomes of Upper Egypt into the first draft for the (as then) proposed Unicode standard for Egyptian Hieroglyphs, it was only afterwards that 42 clicked ‘not that number AGAIN’. I expect the fame of 42 goes back to the dawn of writing and mathematics itself.

Thoth, a (male) Egyptian deity (Egyptian Ḏḥwty; 𓅝, 𓁟 etc.), was associated with wisdom, magic, writing, mathematics, astronomy and medicine. Maat, a (female) deity (Egyptian Mꜣꜥt; 𓁦, 𓁧 etc.) was associated with truth, equilibrium, justice and order. She represents a fundamental concept in Ancient Egyptian philosophy. In some later traditions which featured male-female pairing between deities, Thoth and Maat were linked together (although rarely in a romantic sense). Both deities are prominent in the judging of the deceased as featured in the ‘Book of the Dead’.

The Papyrus of Ani gives a list of 42 ‘negative confessions’ for the deceased – “I have not committed sin”, “I have not murdered” etc. The ‘Ten Commandments’ of the Old Testament can be thought of as a condensed version. Sometimes referred to as ‘the doctrine of Maat’. 42 associated deities, supervised by Thoth, were assigned to the judgment of the deceased during his or her passage through the underworld.

I can’t resist mentioning that the modern name “Book of the Dead” was invented by Karl Richard Lepsius (the Egyptian rw nw prt m hrw has been more literally translated as the ‘Spells of Coming Forth by Day’ or similar). It can be no more than coincidence that the publication in question, “Das Todtenbuch der Ägypter nach dem hieroglyphischen Papyrus in Turin mit einem Vorworte zum ersten Male Herausgegeben” was published in 1842. Lepsius was a major and influential figure during the emergence of the modern discipline of Egyptology as well as being responsible for the creation of the first hieroglyphic typeface as implemented by typographer Ferdinand Theinhardt, the “Thienhardt font”.

The ’42 Books of Thoth’ aka ’42 Books of Instructions’ were composed from around 3rd century BC supposedly based on earlier traditions. Only fragments remain from this Hermetic text which apparently contained books on philosophy, mathematics, magic, medicine, astronomy etc. A legendary source, highly influential in later traditions of mysticism, alchemy, occultism and magic. The 42 Books have been believed by some to contain the hidden key to the mysteries of immortality and the secrets of the Universe. A fruitful topic I guess for Dan Brown and other writers of fiction.

Trivia. Visiting the South Coast last December, I was amused to discover the return rail-fare from Oxford was £42. Got me thinking how often 42 has cropped up in my life. Coincidence can be good fun. I decided to keep an eye open for incidents involving near neighbours of 42: 40, 41, 43, and 44. A prospect so intriguing and exciting I’m surprised I woke up on the approach to a snow and ice encrusted Hove before the train rattled on its way to Worthing. I can now report the scientifically meaningless result after 10 months ‘research’. Those worthy siblings 40, 41, 43, 44 just don’t cut the mustard compared with their famous colleague. Perhaps it’s just me. Although when my son started at secondary school this September, there was a certain inevitability about his reply when asked in what number classroom his form was based. For a moment I thought he was kidding.

I can't really leave the topic without mentioning the obvious.

The writer most credited for the prominence of 42 in modern times is the late Douglas Adams. In his radio series “Hitchhikers Guide to the Galaxy” (BBC Radio 4, 1978), the “Answer to the Ultimate Question of Life, the Universe, and Everything” is calculated to be 42. The meme exploded. Adams later claimed to have picked 42 pretty much at random.

We will never know whether Adams knew of the antiquity of 42 as a profound and famous number, indeed as the answer to his very own ultimate question. Its easy to speculate that he must have held some knowledge, at least at some subconscious forgotten level. A remarkable coincidence otherwise, unless 42 is in fact the answer.

Yet not impossible. After all there is something rather cute and appealing about 42. She still looks good for her age. Don’t you think so too?

Saturday 9 October 2010

ISO/Unicode scripts missing in OpenType

Unicode 6.0 release is imminent (see www.unicode.org), a year after the release of Unicode 5.2 (October 2009). Version 6.0 introduces three new scripts: Mandaic, Batak, and Brahmi. There are extensions to other scripts and many other improvements and clarifications.

An aside to anyone involved in HTML5 standardisation. It would be a really good idea if Unicode 6.0 compatibility were specified as part of the formal standard for HTML, and included in conformance testing.

OpenType is the de-facto standard for font technology and as such an essential part of implementating a script. The latest set of script tags (codes) for OpenType is given at www.microsoft.com/typography/otspec/scripttags.htm (document last updated in January 2008 when checked today).

The current ISO-15924 list of script codes is given at www.unicode.org/iso15924/iso15924-codes.html.

Unfortunately, some Unicode scripts are missing from the OpenType script tag list. This is long overdue an update.

The fact that Unicode 5.2 has not been incorporated in OpenType specifications a year after release makes for an unsatisfactory situation. I am writing to those concerned and encourage others to do likewise.

The following 15 Unicode scripts are missing from OpenType:

Avestan (134, Avst, Unicode 5.2)
Bamum (435, Bamu, Unicode 5.2)
Batak (365, BatkUnicode 6.0)
Brahmi (300, Brah, Unicode 6.0)
Egyptian hieroglyphs (050, Egyp, Unicode 5.2)
Imperial Aramaic (124, Armi, Unicode 5.2)
Kaithi (317, Kthi, Unicode 5.2)
Lisu (Fraser) (399, Lisu, Unicode 5.2)
Mandaic, Mandaean (140, Mand, Unicode 6.0)
Old Turkic, Orkhon Runic (175, Orkh, Unicode 5.2)
Inscriptional Pahlavi (131, Phli, Unicode 5.2)
Inscriptional Parthian (230, Prti, Unicode 5.2)
Samaritan (123, Samr, Unicode 5.2)
Old South Arabian (105, Sarb, Unicode 5.2)
Tai Viet (359, Tavt, Unicode 5.2)

As a footnote. Not available in Unicode yet, but of interest to Egyptology are:

Meroitic Hieroglyphs (100, Mero, formal proposal with WG2)
Meroitic Cursive (101, Merc, formal proposal with WG2)
Egyptian Hieratic (060, Egyh, no formal proposal yet , contact me if you have any ideas)
Egyptian Demotic (070, Egyd, no formal proposal yet, contact me if you have any ideas)
There are also some desirable additions to be made to Egyptian Hieroglyphs (I'd like to see something with ISO/WG2 in 2012 if not before).