Notes on Character Sets, Unix, and Anzio April 3, 1996 Effective Anzio 10.8 INTRODUCTION For most users outside the US, character sets become an issue. The ASCII character set, which fits nicely into 7 bits, includes most characters that US users will need, and is widely standardized. Other countries and other languages require different characters, such as A-umlaut, C-cedilla, and n-tilde. Unfortunately, there are several "standards" by which these can be stored, and this leads to a great deal of confusion. CHARACTER SETS ON THE HOST SYSTEM On the host system, such as Unix, special characters not in the ASCII set can be stored in several ways: 1) ISO 8859-1, an 8-bit set very similar to that used in Windows. 2) IBM PC code page 437, the primary 8-bit set used on the PC 3) IBM PC code page 850, also 8-bit, used on some PCs in Europe and elsewhere. 4) Other 8-bit or 16-bit code sets, such as those used for Japanese or Cyrillic (Russian). 5) Various National Replacement Character (NRC) sets. These are 7-bit sets, where certain ASCII characters that receive less use are replaced by required special characters. This approach, although cumbersome, allows operation over a 7-bit data channel. As an example, the German NRC replaces the left-square-bracket, hex 5B, with the capital-A-umlaut. Note that this can have an impact on VT-sytle cursor-control sequences. CHARACTER SETS ON THE PC A PC running in character mode uses a ROM chip to generate the dots that make up a character. This usually supports code page 437 or 850. Alternatively, DOS can use a translation layer to change from one code page to another. A PC running Windows will generally work in the standard Windows character set, which is essentially the same as ISO 8859-1. Keystrokes are generated in this set. Some screen and printer fonts, however, are in other character sets, notably the "Terminal" font, which is in the "OEM" set, that is, code page 437. Still other fonts may contain a special character set such as Cyrillic. TRANSLATION IN A TERMINAL EMULATOR There are two approaches to translating characters from one character set to another: 1) translate on the host, and 2) translate in the emulator. When the host system is Unix, one utility for host-based translation is "mapchan". A translation file can be attached to each terminal session. This can cause translation in data moving in both directions. A terminal or emulator can also be set for a particular character set. In other words, the terminal (emulator) can do the translation if necessary. Notice that if 8-bit data is being passed between terminal and host, the communication channel must be 8 bits wide. This requires turning off any parity handling, with a command like: stty -parenb cs8 -istrip THE STATUS OF ANZIO Anzio (up through 10.7) displays everything in the code page 437 set. Windows versions of Anzio do this by using the Terminal font (which also contains the line-drawing set used by many host programs). Keystrokes generated by Windows are in the Windows (ISO) character set. DOS versions of Anzio receive CP 437 characters from the keyboard. EFFECTIVE 10.8, Windows versions of Anzio allow you to choose a screen font (the default is "Terminal"). This must be a fixed-space font, so only fixed-space fonts are listed in the dialog box. However, the font may be in the OEM character set (e.g., Terminal or MS Linedraw), in the Windows character set (e.g., Courier New), or in a special character set such as Cyrillic. It can be either a bitmap font or a TrueType font. Bitmap fonts tend to work better for smaller character sizes; TrueType fonts work better for larger sizes. Only the OEM character set has the line-drawing characters which are often sent from the host. If you select a font such as Courier New, Anzio will have to use "+", "-", and "|" to create boxes. However, there may be an advantage to using a font such as this, either for appearance or because certain characters (Icelandic o-thorn) don't exist in the OEM set. When Anzio is set to emulate VT220 or Anzio, it assumes that characters coming in from the host are in the ISO set. If Anzio is set for SCO ANSI or AT386, it assumes the characters are in CP 437. Anzio then looks at the screen font, and performs any needed translation. Other terminal types generally do not support 8-bit characters. Anzio can also deal with NRCs. These can be selected from a menu item, or switched from the host using VT220-compatible sequences. Anzio currently has NO means of dealing with 16-bit sets, such as Japanese.