TTY consoles + UTF-8 + local character support

Mon, 02 October 2006 00:38

The following describes a valid way (the only possible I've found) to set up support for greek keyboard layout in TTY consoles, complete with accents through dead-keys, using a UTF-8 locale. Of course it should work for any language for which you can find a consolefont and a keymap, in any distribution with kbd installed.

What we want to do is simple and perhaps obvious to most, however can be tricky for those that have difficulties grasping the whole concept of how consolefonts, keymaps, consoletranslation and LOCALEs relate together. I've had such difficulties too, so I might be wrong in some places -please correct me if so.

First I'll do my best to explain the theoritical aspects involved. Then I describe the few simple steps that I followed to set greek support.

Theory

What does having proper console-support for my language involve?

It involves several parts. First it is important to understand there is a difference between being able to display non-ASCII characters in console and being able to type non-ASCII characters through your keyboard. The following systems/concepts are involved in the following ways:

--------------------------------------
- Character Encoding.
--------------------------------------

Or "Character Set", "Codeset", "Codepage" and similar.

A character encoding is just a table that matches Human-understood characters with Computer-understood representations of those characters. As we know, computers' only way of describing things is in bits (streams of 1 & 0). Humans describe things using units of more artificial constructs like "alphabets", tied together with language rules. Because computers are our slaves and not the other way around, we want to exchange information with them using our own way to describe things: using alphabets and characters. A character set does not define what computer-understood representations of human-readable characters could mean to the computer, its only purpose is to represent them in a unique way. So it is really a very simple table (ok, not so simple in the case of Unicode). Still, because there are many characters that humans want to use and only two "characters" in computer's alphabet, human characters need relativelly much space to be described.

As a legacy from the early days of computing, where 1 byte was an expensive piece of space in terms of availability, performance and cost, and mainly because of compatibility, today many things are still stuck in the so-called "single-byte encodings".

Single-byte encodings are tables that use 8 bits for each character, to describe as many characters as possible in that space (well, that was not true initially but after the additions of the iso-8859-x character encodings, the 8-bit space was completelly utilized). The problem with Single-byte encodings is that they can't describe enough characters: barelly about 200. If we consider case-sensitivity and all those special symbols, we see that 200 slots are just not enough for more than a couple language alphabets. So they created multiple tables, one for each major alphabet in the world. Good news are that all of them described characters of the latin alphabet and common symbols (all the ASCII characters that is) the same way. So no matter character what set one used, he could always use i.e. english. Bad news are that all other characters where incompatible among the various character sets. So people would never be able to write greek and french using the same set. They should switch or they should use specific translation code in their software, which made things messy and buggy.

Today a few extra bytes per character pose no big deal for our memory and storage media capacities. So implementations of Unicode began to become popular. Unicode is a super-character-set, including all the characters of all the languages and special symbols in the world. UTF-8 is one such implementation that uses variable-byte-length to describe characters. Some characters are described using one byte (for compatibility and economy of space reasons) and all the rest either on the 2-byte space or in the 3-byte space. Few characters use more than 2 bytes though.

In Linux most programs, including the console, will encode/decode their I/O characters using the character encoding defined on the system LOCALE.

What would happen if one tried to interpret text encoded in one character encoding using an other? Depends on whether the binary representation of the encoded character has a valid match on the encoding table used to decode, on the font used and the software that handles the I/O. Results would varie, from seing nothing to seeing irrelevant characters that might be printed as garbage with the current font, or replaced with some predefined character by some pieces of software.


For a more detailed and correct description, refer to WikiPedia's article:
http://en.wikipedia.org/wiki/Character_encoding


--------------------------------------
- LOCALE.
--------------------------------------

We usually refer to these as a bunch of variables that are checked from various programs in order for them to decide how to display their characters, messages, and various standard symbolisms. Those usually depend on the location (or ethnicity) of the user and thus the naming. Each LOCALE variable in fact is a specific group of internal variables that define even more specific things. We don't have to know what internal variables do, just have an idea about the external. For example, LC_TYPE (probably the most important one till now, but as soon as everyone moves to unicode it should be the most irrelevant one) defines how each character should be treated and according to which encoding table applications should encode output characters and decode input characters. An other example is LC_TIME, that internally contains variable such as the national naming for Months and Days.
All those variables are provided from GLIBC, and can be changed on demand on a per-group basis (see above), simply by exporting the external group-variables with a supported value. The values that GLIBC will support were defined when it was compiled.

To view all the supported LOCALE values, execute "locale -a". To view your current ones execute "locale". For more information read "man P locale". Here's a useful table extracted from gentoo.org documentation:
 
LC_ALL		Define all locale settings at once. This is the top level setting for locales which will override any other setting.
LC_COLLATE 	Define alphabetical ordering of strings. This affects e.g. output of sorted directory listing.
LC_CTYPE 	Define the character handling properties for the system. This determines which characters are seen as part of alphabet, numeric and so on. This also determines the character set used, if applicable.
LC_MESSAGES 	Programs' localizations for applications that use message based localization scheme (majority of Gnu programs, see next chapters for closer information which do, and how to get the programs, that don't, to work).
LC_MONETARY 	Defines currency units and formatting of currency type numeric values.
LC_NUMERIC 	Defines formatting of numeric values which aren't monetary. Affects things such as thousand separator and decimal separator.
LC_TIME 	Defines formatting of dates and times.
LC_PAPER 	Defines default paper size.
LANG 		Defines all locale settings at once. This setting can be overridden by individual LC_* settings above or even by LC_ALL.


Defining LOCALEs appropriately is required if you want support for non-ascii characters and conventional namings of your area for things such as dates, months and currency representations. Typically you will be fine by defining LC_ALL and LANG to your native language/encoding and then setting LC_MESSAGES to POSIX (or other english-based locale) to ensure you get proper messaging.

Like all enviromental variables, LOCALES must be defined each time you execute a new shell. The proccess of global assigment can be automated by inserting the exports to /etc/profile, but most distributions, to keep things tidy, have a seperate place for defining LOCALES. In Frugalware this is /etc/profile.d/lang.sh (sym. linked to /etc/sysconfig/language).


--------------------------------------
- Terminal
--------------------------------------

I won't go into detail here, both because I don't know much and because it's not necessary. All that you need to know, is that the terminal is an underlying to our console system, left from an age where terminals where "dumb" hardware devices used to exchange I/O with mainframes. Today it is virtually implemented through software that does certain boring jobs and that in most cases won't bother you with anything. This time, however, you need to inform it about what kind of encoding our characters will have.
This is done simply by sending a special character sequence to it. To set utf-8 support, we can "echo -ne '\033-F'" to send the sequence to our current terminal, or you can > the stdout to any terminal through /dev/tty*.


--------------------------------------
- Keyboard Mapping/Layout.
--------------------------------------

Like computer needs a table to describe internally the hundreds of characters we use, it also needs a table to match those characters to the key signals that we send by typing in our keyboard. And not only those, but also special sequences of them, i.e. that the key of shift received at the same time with the key sequence of "A" should reverse the capitalization of "A". How does console knows that the "a" and "A" have a special relationship? It doesn't! As far as it's concerned, "a" and "A" are two different characters. It only maps unique keys/key sequences with unique characters (that it knows how to describe in a specific encoding). So we have an other table, matching our keyboard's keys with their appropriate characters. Of course the character mapping table, in order to do that needs to be specific about two things: 1) the physical aligment of our keyboard (if the keys are aligned according to QWERTY, DVORAK, or other aligment) and 2) how should a character be binary-represented (in other words, the encoding to be used). So, keymaps are character-encoding-specific and keyboard-physical-layout-specific.

* Notice that Keyboard Mappings relate exclusively to our ability to type the characters we want through our keyboard. If the specific computer doesn't use a keyboard for input (i.e. is a headless remote server accessed through telnet/ssh) you don't need to bother with Keyboard Mapping to have proper support for your language.

KBD, appart from an engine that allows extensive programming of key sequences, comes with a number of pre-written mappings, mapping keys on various keyboards (i.e. following the UK and QWERTY layouts) to codes that match certain characters in the specific Character Encoding table (i.e. iso-8859-7).

Those keymaps in Frugalware are included in /usr/share/kbd/keymaps and can be defined in /etc/sysconfig/keymap. The utility to put them to work is "loadkeys" ("man 1 loadkeys") and in Frugalware it's execution is automated through /etc/rc.d/rc.keymaps control script.

--------------------------------------
- Fonts.
--------------------------------------

They are the final piece of the chain of displaying characters, following the charset. They are sort of tables on their own, only though they match codes with "glyphs". Codes that match certain characters in a Character Encoding table, with Glyphs -basicaly, images- that our software knows how to display on our screen, and we know that they represent the specific characters when we read them. Font system, like keymap system, is tied to the local hardware. This time not with the keyboard, but with the graphics adapter.

* For a remote headless server that you administer through ssh, you won't have to bother with fonts. For a remote headless server that you administered through vnc or something similar, you would have to bother with fonts however, because the actual processing and displaying would no longer take part on your local computer.

KBD, again, provides the utility to define and display fonts on a screen. This is "setfont" ("man 8 setfont"). Also, kbd again comes with many fonts for various encodings. In Frugalware you can find these in /usr/share/kbd/consolefonts. The execution of setfont in Frugalware is automated through /etc/rc.d/rc.font script, and you can pass the fontname through /etc/sysconfig/font.


--------------------------------------
- Translation.
--------------------------------------

If we do not have a font that supports unicode, or don't want to use it, we have two choises:
We can define a console translation table to setfont, each time we load the font (through the -u argument), or we can merge a unicode table to the font, through the psfaddtable utility. The second option didn't work for me, however the first did. The -u argument needs to be added manually to the rc.font script in Frugalware, since for the moment is not defineable through a variable in /etc/sysconfig/font, like the fontname.


What are the possible complications?


According to all the above, if we wanted to have i.e. support for greek in a TTY console (notice, we are talking about a "real" terminal console, not a Pseudo-Console accessed through terminal emulator software such as Xterm, Konsole or Gnome-terminal), we would define a greek keyboard mapping through loadkeys, set a greek font, through setfont, set our LOCALEs (particularly LC_TYPE) to a greek supported locale, like. "el_GR" and inform the terminal to expect characters in this encoding, by sending a special code. As long as all these support the same character encoding, we'd be fine. Indeed, as we can see, there is a greek LOCALE using iso-8859-7 encoding (el_GR), a greek keymap written for iso-8859-7 encoding (gr.map.gz) and also a font providing glyphs for characters in iso-8859-7 encoding (iso07.16.gz). So if we'd do the following, we'd be allright:

 
export LC_ALL="el_GR"
export LANG="el_GR"
loadkeys gr.map
setfont iso07.16
echo -ne '\033-F'


However, generally, keymappings for non-iso-8859 are hard to find. That includes unicode encodings that, like previously mentioned, have some advantages. For greek in particular, I haven't seen of any decent unicode keymap. It is not impossible for someone to create his own keymap for KBD, however it is a lot of boring and unrewarding work.

More importantly, there is a kernel-level limitation which doesn't allow "dead key" events in keyboard mappings to work in unicode. "Dead Keys" are keys that don't have an effect when they are pressed, but instead cause an effect to the next key that will be pressed. These keys are often used to add accented characters and other special punctuation marks in some languages. particularly in greek, all multi-syllable words have at least one accented vowel! Now, on a unicode keymap, as is, you need to assign some other key sequence for adding accented characters. But you can't use keys as "dead". This is very annoying -at least to me-, even if I don't intend to type much greek in a TTY console.

I don't know why exactly, however using the method described bellow allowed me to overcome this limitation. If anyone can explain why, please do Wink

Practice

Okey, enough with theory. Here's what you can do to enable support for greek characters in Frugalware and still use UTF-8 as your system's locale.

1) Define automatical exporting of sytem's locales to each new shell. You can do this by editing /etc/profile.d/lang.sh like this:

#!/bin/sh
 
# /etc/profile.d/lang.sh
 
# Set the system locale
# For a list of locales which are supported by this machine, type: locale -a
 
export CHARSET=utf-8
export LANG="el_GR.utf8"
export LC_CTYPE="el_GR.utf8"
export LC_NUMERIC="el_GR.utf8"
export LC_TIME="el_GR.utf8"
export LC_COLLATE="el_GR.utf8"
export LC_MONETARY="el_GR.utf8"
export LC_MESSAGES="en_US.utf8"
export LC_PAPER="el_GR.utf8"
export LC_NAME="el_GR.utf8"
export LC_ADDRESS="el_GR.utf8"
export LC_TELEPHONE="el_GR.utf8"
export LC_MEASUREMENT="el_GR.utf8"
export LC_IDENTIFICATION="el_GR.utf8"
 


2) Use iso07.16 as your font:
 
# /etc/sysconfig/font
 
# specify the console font
 
font=iso07.16


We also want to use a console translation table for our font, since it does not support Unicode, and we also want to inform our terminal to expect utf-8 encoded characters. rc.font script is a good place to do so. Also remember to comment out or remove the dumpkeys|loadkeys line. Edit /etc/rc.d/rc.font as following:
 
#!/bin/bash
 
# (c) 2003-2006 Miklos Vajna <vmiklos@frugalware.org>
# (c) 2005      Marcus Habermehl <bmh1980de@yahoo.de>
# rc.font for Frugalware
# distributed under GPL License
 
source /lib/initscripts/functions
TEXTDOMAIN=font
TEXTDOMAINDIR=/lib/initscripts/messages
 
if [[ "$2" != "S" ]] ; then
        msg $"Loading console font"
fi
 
if [ -e /etc/sysconfig/font ] ; then
        source /etc/sysconfig/font
        if [ ! -z ${font} ] ; then
                setfont ${font} -u /usr/share/kbd/consoletrans/8859-7_to_uni.tra
ns
                retval=$?
                if [[ "$2" != "S" ]] ; then
                        ok ${retval}
                fi
        fi
fi
if echo $LANG |grep -qi utf; then
       kbd_mode -u
       dumpkeys |loadkeys gr.map
       echo -ne "\033%G" > `tty`
fi


4) We want to use a greek keymapping. Edit /etc/sysconfig/keymap and make sure rc.keymap is being executed on runlevels 3 & 4:
# /etc/sysconfig/keymap
 
# specify the keyboard map, maps are in /usr/share/kbd/keymaps
 
keymap=gr.map


# ls -l /etc/rc.d/rc?.d/*.keymap
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc0.d/K50rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc1.d/K50rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc2.d/S80rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 06:02 /etc/rc.d/rc3.d/S80rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc4.d/S80rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc5.d/S80rc.keymap -> /etc/rc.d/rc.keymap*
lrwxrwxrwx  1 root root 19 2006-09-29 05:44 /etc/rc.d/rc6.d/K50rc.keymap -> /etc/rc.d/rc.keymap*


5) For some reason font needs to be loaded in each new login session. Let's do something close-enough, by adding the following to /etc/profile:
 
if [[ `tty` = /dev/tty[0-12] ]]; then
source /etc/rc.d/rc.font
fi


After we reboot (actually, re-enter our runlevel and relogin), everything should work.

Still, I'm a little confused as to *why*. Particularly why an iso 8859-7 keyboard mapping works while the keyboard driver is in unicode mode and why does the above resolve the limitation of dead keys. If anyone can explain it feel free! Smile

regards,
nske