Wednesday 21 April 2010

Introduction to the UMdC file format for Ancient Egyptian

UMdC (Unicode Manuel de Codage) is a new file format for documents containing Ancient Egyptian. This informal note is aimed at people familiar with versions of the ‘Manuel de Codage’ (MdC) protocol as used in applications such as InScribe, JSesh, MacScribe and WinGlyph. My objective here is to explain a little about what UMdC is, why I’ve devised this new format and how I envisage it being used. The good news is the fact that UMdC is highly compatible with MdC so there is little in the way of learning curve required and there is no need to ditch existing software tools and methodologies entirely.

Manuel de Codage
A scheme for representing Ancient Egyptian was published in 1988. Manual for the Encoding of Hieroglyphic Texts for Computer-input (Jan Buurman, Nicolas Grimal, Jochen Hallof, Michael Hainsworth and Dirk van der Plas, Informatique et Egyptologie 2, Paris 1988). This is generally known as Manuel de Codage or simply MdC. It is useful to refer to the original scheme as MdC88.

MdC88 was never a formal specification and has been interpreted and extended in several ways for use by applications that work with Ancient Egyptian. There is no ‘standard’ MdC, only dialects. In many simple cases this is not a problem, everybody agrees what ‘+sO34-N37:Y1’- represents as hieroglyphs. However for more complex texts there is scope for ambiguity, confusion and incompatibility.

UMdC basics
Here is a list of some UMdC characteristics. My goal has been to keep things as simple as possible and avoid scenarios which may be useful for some purposes but are not in my opinion appropriate to be addressed in an MdC-like approach. It is not a specification; I simply want to give a flavour of what is involved.
  1. UMdC files must use the ‘.umdc’ file extension (i.e. umdc file type) except in special circumstances. MdC88 did not define rules for file names so several alternatives are in use.
  2. UMdC files must use UTF-8 (Unicode 8 bit) encoding. Unicode allows most modern languages to be written from English to Kanji to Arabic and Hebrew. MdC88 specified ASCII so even the accented characters popular in some European languages are not present. Dialects of MdC often use the ISO-8859-1 (Latin-1 Western European) 8 bit coding or similar but more by accident than design and there is a lot of scope for confusion.
  3. UMdC files begin with the 8 characters ++++UMdC so software knows this is really meant to be a UMdC file and can proceed accordingly. MdC88 compatible software will interpret this sequence as a comment. It is permissible to precede this sequence with the Unicode BOM (some text editors such as Windows Notepad add the BOM and it would be confusing not to accept this) although doing so may throw some software!
  4. UMdC follows MdC88 in stating that all text content is preceded by +l (normal text), +b (normal, bold text), +i (normal, italic text), +t (transliteration), +c (Coptic), +g (Greek) +s (hieroglyphs) and ++ (Comment). The ! and !! conventions for end of of line, end of page are used. This means UMdC is very compatible with MdC at one level. The rules here are however more tightly defined as will be detailed in specifications.
  5. UMdC adds the notion of umdc-instruction. All umdc-instructions begin with the three characters +++. The beauty of this approach is MdC88-compatible software interprets a umdc-instruction as a comment so although information may be unused hieroglyph segments etc. survive unchanged. UMdC uses umdc- instructions for most new functionality such as rich text formatting options.
  6. UMdC version 1.0 requires that Gardiner codes and mnemonics are based on EGPZ 1.0 specifications. MdC88 defined Gardiner codes but this set was superseded so although everyone agrees what “A1” and "n" mean the same is not true beyond the common Egyptian Grammar set.
  7. Applications can elect to use application-specific umdc-instructions to implement features such as special hieroglyph layout options or non-EGPZ coding conventions. This is not encouraged except where unavoidable and there are rules.
  8. UMdC itself cannot be extended by an application provider, only as an official change to the specification. A non-complying UMdC file counts as an error pure and simple; there are no 'dialects'. Rules govern future official extensions to avoid breaking software written to the current specification.
In short, the UMdC file containing

++++UMdC+lHello +sV9:W24-O49-!

Corresponds to MdC88

+lHello +sV9:W24-O49-!

UMdC Development Roadmap
I am working on UMdC documentation (to go on http://www.egpz.com/). This consists of user-oriented material, a technical reference, and implementation guidance for software writers and others who want to support UMdC.

I am also working on a complete implementation of a document editor for umdc files to be included in InScribeX Web (Preview 4) (to go on http://www.inscribex.com/).

UMdC support is also being included in a second edition of the InScribe 2004 software, namely InScribe 2004SE. As part of this, the ‘.InScribe’ file format is being adapted to be 100% UMdC compatible. Support for ‘.InScribe’ and ‘.umdc’ file integration into Windows search is another useful feature.

My aim is to get most if not all of this work completed in advance of the Informatique et Égyptologie 2010 meeting to be held early July in Liège. The project is unfunded so we shall see!

No comments:

Post a Comment