ARG_MAX
| Shells
| whatshell
| portability
| permissions
| UUOC
| ancient
| -
| ../Various
| HOME
$@
"
| echo/printf
| set -e
| test
| tty defs
| tty chars
| $()
vs )
| IFS
| using siginfo
| nanosleep
| line charset
| locale
or: how to get accented characters and umlauts
Update remark: Meanwhile this document might probably have become outdated.
It's rather 8bit than Unicode.
It originates from a time where it could be
exceptionally diffcult to get locales running on some traditional unix variants.
This document is not a source of fix-it-fast help, but tries to assist in solving even nasty problems. Thus, it contains both my experiences as well as numerous basic explanations and pointers to external documentation.
My advice: first fly over the following once and then read it.
Mainly, this page deals with printability
of characters, i.e., the locale category "LC_CTYPE
".
The examples emphasize the 8-bit
ISO
8859 family
(link to a PDF of the according standard, see p.7), especially ISO 8859-1 or
ISO 8859-15
aka latin 9. The latter link points to a comparison by J.Korpela.
Nevertheless, this page also contains hints and pointers about locale handling in general.
However: you might be interested in Unicode as well, which is
not covered by basic 8-bit locales.
Then see below for some pointers to
good documentation. For now, keep the command iconv(1)
in mind,
the "codeset converter" - if you ever unpreparedly meet UTF-8
(or another "incompatible" codeset).
LC_COLLATE
, more system specific stuff,
vendor documentation)
By default, many Unix flavours only process ASCII without any problems (the standardized minimum 7bit-character-set). This is mainly for historical reasons, portability and usability.
If you want to use national characters on these systems (using 8-bit or multi-byte characters), then you must tell that to most programs. The general way to do that is by setting some environment variables to appropriate values, i.e., setting the "locale".
Unfortunately these values are not standardized and differ among the various Unix flavours.
There is not even a common, portable value for ISO8859-1 (aka "Latin 1"), which extends ASCII and is the most frequently used replacement in the western world. You have to find a valid value for your needs on your system.
The usual error messages, such as "Couldn't set locale
correctly"
(if you get it at all), are not nearly meaningful enough.
If your settings are wrong you only know: "apparently", it doesn't work.
Almost all values provided on a particular system work properly. But in special cases you're likely to run into problems again: The only command to find out about which values are supported by your system, "locale -a", doesn't tell which categories/environment variables are actually supported for a particular value. (On some systems there's a flag to find out about that - with rather complicated output on some systems - however some implementations are broken.) You even might have to look into system specific directories to find out about the supported categories.
And: several systems don't provide the command
locale(1)
at all.
Now, even a program paying attention only to the single category "printability of characters" might be affected by most of the above. And although setting the locale usually is rather simple and a feature in general (offering numerous categories and regulating subtle differences between the various languages), you might have become annoyed at first.
Fix it once and forever - that's Unix. An example for a standard program paying attention to the locale on most (commercial) systems is the traditional vi(1). Other examples are tin(1), mutt(1) and perl(1).
By the way, you don't need a national keyboard to type national characters. Using the "AltGr" mechanism is even rather comfortable. See general ways to use special characters in X11 if you're windowed.
On a few occasions the following deals with the mail client mutt, which was the original cause for writing a page about this subject. (For german language readers: Since the newsgroup for mutt, comp.mail.mutt, is english speaking, this document is not written in my mother tongue.)
Certainly, the locale can also serve other character-sets. I will use ISO8859-1 just as a placeholder in the following.
The Danish-HOWTO is written in english and useful for all Europeans on any Unix in general, as it provides tips for many Unix applications. Also, there's a short linux-specific description about regenerating the locale (localedef).
(For german-lang readers: German-HOWTO.)
FreeBSD provides the login.conf(5) mechanism, which offers a very general - and recommended - way to do "localization". Open- and NetBSD provide login.conf as well, but there's no support for internationalization.
In the usenet newsgroup comp.mail.mutt
for instance,
questions regularly show up how to correctly display 8bit characters with
the internal pager of this mailer. In particular other
affiliated programs, mainly editors or external pagers like
vim(1)
, less(1)
and several GNU tools
apparently "can do it" by default?
The reason is that mutt(1)
exactly follows the settings
of your according environment variables.
mutt(1)
does not fall back to a common
character set like the western europe iso 8859-1, providing
accents and umlauts, but mutt stays with the standard locale
(C
or POSIX
), which usually means
'no features'. The name is derived from the former being defined by
ANSI C, the latter by POSIX / SUS.
In fact this even can be considered a feature: Keep in mind, that when you really work with language specific characters, numerous things might behave different:
LC_CTYPE
)
LC_COLLATE
).
In some languages "z" is not the last character in the alphabet.
There are even locale values on some systems, that let you distinguish
between the different sorting of telephone book and lexica.
LC_MESSAGES
)
LC_TIME, LC_MONETARY
)
LC_CTYPE
.
That's the Unix way - having set it properly once, all reasonable
applications suddenly know what you want.
Let's have a look at various programs:
For example vim(1)
, emacs(1)
,
jed(1)
and partly joe(1)
"know" printing
8bit characters out of the box.
Some of them don't care about the locale, because they have their
own configuration options. A few do so to support encodings which
might not be supported by a system locale (e.g. unicode), and
some just don't care because they didn't know better, like in
pre-223 versions of less(1)
for example.
However, other programs do not ignore the locale:
vi
(that is, the traditional
vi) needs a proper setting for one category of the locale: LC_CTYPE
.
LC_CTYPE
and LC_MESSAGES
, for example mutt(1)
and tin(1)
.
LC_CTYPE
and
might even silently refuse to accept/print eight bit characters
if the setting isn't appropriate.
And as another example, some shells also behave according to the locale:
bash-2.x
, the readline library initializes
according to LC_CTYPE
at startup. If you don't go this way, you have
to fiddle with the very readline settings to be able to type 8bit
characters (on Linux, your distributor might have already done it,
so that you never needed to adjust readline settings yourself, but
it was necessary as well). More about bash/readline
:
See the post scriptum below.
ksh88
and ksh93
,
as well as tcsh
, even track LC_CTYPE
at run time.
So in general, numerous applications will consider the locale.
Nearly all Unix-systems know about locales, but the valid values are not the same on all systems. Some systems recognize only very few and special values. So find out and set the appropriate value/s for the locale. But before looking at system specific things, what does a certain value mean at all?
First look at your current settings:
$ env | egrep 'LANG|LC_' LC_CTYPE=en_US
You could also try
$ locale LANG= LC_CTYPE=en_US LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL=
(All general categories will be reported, even those which are not explicitly set. In most implementations the double quotes signal an implicit setting.)
However, the latter way is not as robust as may seem: If a category
is set with an invalid value, setting the locale fails completely.
However, locale(1)
won't report an error.
(By the way: Note that some applications certainly might use their own very special variables, but that's not of concern here then.)
Now try "locale -a
" to see the available
values for the locale on your system.
This command doesn't exist on a few Unix flavours -- see below then.
Main point is to set LC_CTYPE
to an appropriate and
legal name for your system (see below).
Then you should get your 8bit characters printed.
A correct value for western-europe 8-bit likely "sounds"
like "iso88591"
or "en_US"
, because on
numerous Unix systems I tried it was always one of the following:
iso_8859_1, en_US, en_US.iso88591,
en_US.ISO8859-1, en_US.ISO_8859-1
.
No rule without exception. Vincent Lefèvre reports:
on Maemo (Nokia's Linux distribution for phones), en_US
actually implements UTF-8, not Latin-1.
And in the future, more implementations
might move to implement UTF-8 by default.
There are other variables for other meanings -- but two of them are special: They don't mean a real single category but influence all other categories in a general way:
LC_ALL
, which overrides all others.
Thus it should be set for debugging purposes only
(e.g., enforcing a fall back to 7-bit ASCII with the
value "C
").
LANG
, which has lower priority than all others.
It doesn't override any value, that's its very purpose
in contrast to LC_ALL
.
(So if you only want to adjust printability by setting LC_CTYPE
,
be sure that LC_ALL
and LANG
are unset.)
However, the ouput of "locale -a
" only means that
there is some support for those values, no matter for what categories
exactly. This means: if you have set LANG
but still have
problems, then only one category, e.g. LC_CTYPE
or
LC_MESSAGES
might support this value.
If you want to read messages and menues in your mothertongue,
set LC_MESSAGES
. (This already happens implicitly if you set LANG
.)
Keep in mind that the application must come with the appropriate
translations itself, properly installed, because the system certainly
can't know them.
Look for manual pages like 'environ(5)/(7), locale(1)/(7)/(5),
setlocale(3C)/(3), localedef(4), i18n_intro(5), l10n_intro(5),' etc,
and find out about all the according environment variables,
the most important ones being LC_ALL, LC_CTYPE, LC_MESSAGES
and LANG
.
Pay attention to chose the proper section, because there
might be several entries with the same name. This means
for example "man 5 environ"
(or
"man -s5 environ"
on Solaris). The numbers
in parentheses above are suggestions for sections in which you might
find them. -- It's time to do "man man
" now, if you
didn't knew that by heart.
In general, a value for a locale category (the according environment variable) is constructed like this:
"xy[_XY][.codeset][@modifier]"Unfortunately there is no standardization.
xy: language-abbreviation, ISO 639-1 (2 characters), [ISO 639-2 (3 characters) might be used for languages without a two letter code] XY: country/territory, ISO 3166 [3166-2 three letter codes might be possible] codeset: f.i. "iso88591", "ISO8859-1", "UTF-8", "greek8", "roman", etc modifer: anything else refining it. for example "euro" for the currency symbol, or "phone" for a different sorting order (LC_COLLATE
).
Some examples for such values:
C
" - the standard value, usually the default,
the same like not setting the category at all.
7-bit ASCII charset, no goodies.
Ironically enough, at least one vendor (HP) apparently felt the need
to provide "C.iso88591
".
The name C is associated with ANSI C.
en_US, en_US.iso88591
" - ascii and
the western europe specific characters.en_US
(if available) always contains
iso8859-1
, even without the codeset suffix.
However HP-UX 10/11 provide only values
with the codeset, for example en_US.iso88591.
You see it's essential to find out about the
valid values instead of only guessing.
de_DE
" is valid syntax,
this value doesn't exist on many Solaris versions
(but only "de
").
fr_CA.roman8
" - might be appropriate
for canadians
zh_TW.big5
" traditional chinese in taiwan
with the BIG5 codeset (not an eight bit locale, but a good example).
en_US.ISO8859-15@euro
" - example from Solaris
supporting the "euro sign" instead of the dollar sign as
currency character (apart from that @euro is the default
for iso8859-15).
stty cs8 -istrip
".
Many shells and TTYs require this.
telnet(1),
confirm that you run in 8bit mode:
Press <CTRL-]> and then "set ?" and "toggle ?". See the
variables inbinary
and outbinary
.
Fix them or start telnet with the right options.
Adjust ~/.telnetrc
. (Note that ssh(1) is 8bit-clean.)
dtterm(1)
(not on Solaris, but on
HP-UX, AIX), hpterm(1)
, aixterm(1)
.
You might have to start them with correct settings.
Yes, this can be sort of a crux...
xrdb(1)
,
"xrdb -q
", to see the general settings and
appres(1)
for application specific
resources. An example: xterm ('appres XTerm xterm') knows the resource
"XTerm*eightBitOutput
", it correctly defaults
to True
. (Note that the resource
"eightBitInput
" has a
completely different meaning and is not of concern here).
perl(1)
there's an elegant way to print
the 8bit characters:perl -e 'for$i(160..255){printf"%c%c",$i,($i%16==15)?10:32}'
setlocale(3)
in your prorgram succeeds.
Use the example
below to get detailed information about this step.
$ unset LANG LC_ALL; LC_CTYPE=<value> export LC_CTYPE
% unsetenv LANG LC_ALL; setenv LC_CTYPE <value>
Check also system wide configuration files which tend to
set LANG
or even LC_ALL
.
(This might be /etc/*profile and /etc/default/[i18n|lang]
for example.)
LANG
and you haven't noticed this.
As a simple example: For setting LC_CTYPE
there
is an entry like one of these on almost all systems:
/<path-to-locale-directory>/<locale-value>/LC_CTYPE/ctype
(e.g. Solaris)
/<path-to-locale-directory>/<locale-value>/LC_CTYPE
(e.g. Linux glibc2)
/<path-to-locale-directory>/<locale-value>
(e.g. HP-UX)
On Linux, pay attention to /usr/share/locale
vs. /usr/lib/locale
. Both might exist due to an upgrade
(with only one containing a locale).
For mutt
this is the configure
switch
"--enable-locales-fix"
, so you have to recompile mutt.
Another example: tin
provides "--disable-locale".
Also, some programs might not handle the wide-character support of glibc.
Pre-mutt-1.3 in connection with such a glibc is an example.
Recompiling with said option should help.
For mutt-1.3 (developer versions), if you have still problems,
use also "--without-wc-funcs"
,
(without wide character functions). You should have seen
it already in INSTALL and "configure --help"
.
LC_MESSAGES
, your value
must include the country (see syntax above).
Thus, it's not "en, de, ...
" but
"en_US, de_DE, ...
", even if the former are reported
as valid values by "locale -a".
Why? In the directories named by language abbreviations
(i.e., "de", "fr", etc.), you'll usually only find the translations
of messages (LC_MESSAGES
) for various programs.
But LC_MESSAGE
stuff is accessed by a mechanism different from
the other categories.
And: Don't confuse a language abbreviation (fr, es, de) with
a locale alias (like french
, spanish
,
german
) from the file /usr/[share|lib]/locale/locale.alias.
Be very careful about using these aliases, as well.
LC_CTYPE
files in the
system, but there's only one common entry for each value,
like /usr/lib/nls/loc/locales.1/en_US.iso88591 .
(confirm that by looking into
/usr/lib/nls/loc/src/en_US.iso1.src
).
LC_CTYPE
to the value "iso_8859_1
".
Message-ID: <slrn8trvil.1qm.hschlen@humbert.ddns.org>
,
Heiko Schlenker mentions that on Debian GNU/Linux one might still
have to post-install a package - like
"user-de"
for "de" support.
isprint(3)
doesn't work like expected.
See localedef(1)
for fixing or rebuilding a
locale installation on the lower level (pointed out by
Jürgen Dollinger).
This command is - like locale(1)
- available on practically
every Unix, except Free/Open/NetBSD, SunOS 4
and Irix5.
You'll find an example in the Danish HOWTO.
Some Linux distributions come with their own way to do this (e.g., Debian).
LC_CTYPE
(unless you have
disabled 8bit support by recompiling libc with -DUSE7BIT
).
LC_CTYPE
is supported and you'll find
the supported values in /usr/share/locale/, see mklocale(1)
.
See also src/lib/libc/gen/ctype_.c.
Output on the console ttys might be limited to ASCII before OpenBSD 2.9:
From: "Arvid Grøtting" Newsgroups: comp.unix.bsd.openbsd.misc Subject: Re: Problem with ASCII representations Date: Fri, 16 Mar 2001 09:43:26 GMT Message-ID: <l8u24udp6p.fsf@gorgon.netfonds.no>
Concerning console drivers, see pcvt(4) up to 2.8 and wscons(4) from 2.9 on.
If you use the on-board vi (in fact it's "nvi") on systems before 2.8, see the post scriptum below about nvi.
Apart from the above: Programs certainly might install
their own messages, using LC_MESSAGES
then.
However, this is done with a mechanism completely different
from setlocale(3), so it's not affected by the above limitation.
You might try resorting to Linux emulation, if you ever need something very special.
NetBSD 1.5 and earlier have very little support for locales.
Be aware of setlocale(3) being a stub for all categories but
LC_CTYPE
. setlocale(3)
implies that there is support for LC_CTYPE
(see its BUGS
section), but AFAIK there is none.
A look into the system directories (/usr/share/[locale|nls])
will confirm this.
See
http://www.netbsd.org/Documentation/misc/index.html#locales
for LC_CTYPE
support. But the link therein was not accessible under
special circumstances (firewall configurations) at the time of this
writing. Thus, see also
ftp://ftp2.fr.netbsd.org/pub/NetBSD/arch/i386/french-1.4/locale.tgz
in case.
Planning for multi-byte support has been started, but I haven't been following that, as I don't run NetBSD myself:
> From: itojun@iijlab.net (itojun@iijlab.net) > Subject: multibyteLC_CTYPE
locale support from Citrus XPG4DL repository > Newsgroups: comp.unix.bsd.netbsd.announce > Date: 2001-01-25 07:59:59 PST > > NetBSD-current now integrates multibyteLC_CTYPE
locale support, > from the Citrus XPG4DL codebase. > [...] > http://citrus.bsdclub.org/index-en.html
NetBSD 1.6 comes with several locales installed.
From setlocale(3): The current implementation supports only
the "C" and "POSIX" locales for all but the LC_COLLATE, LC_CTYPE,
and LC_TIME categories.
However you can set the other variables anyway. The libc will only stat the locale directory itself, but not try to access category specific files then. (Yet, this dummy-stat() certainly fails for invalid values.) Specific applications might make use of those categories in their own way.
And, as mentioned at the top, don't forget about
login.conf(5)
, e.g. using a
~/.login_conf
with
me:\ :charset=iso-8859-1:\ :lang=en_US.ISO8859-1:
First it tries to set the locale like other programs. Then it
additionally inspects LC_CTYPE
and LC_MESSAGES
more thoroughly, indicating
the printable characters according to isprint(3), and issuing an error
message with perror(3) to see the language of system messages (which
will be english in most cases, though). But it doesn't try to print
a nationalized messages of its own or of another utility (because
you might have to install these messages in a system directory).
It will complain about all problems that occur.
It will warn if a call to "setlocale
" returns with a
value different from the value it was (implicitly) called with: Some
locale implementations internally additionally try modifed values,
particularly if your value contains a charset or modifier suffix.
(And if you set both LANG
and another explicit category,
then setlocale()
will return a "composite value".)
-- Thanks to Alain Bench for pointing this out to me!
Note: Depending on your font settings and your browser, you might not be able to see the latin1 characters contained in the following quote. (Also note that you usually shouldn't mix different locale values - certainly with the exception of "unsetting" some categories with the value "C".)
Example:
$ uname -sr SunOS 5.9 $ LC_CTYPE=iso_8859_1 LANG=nonsense LC_MESSAGES=POSIX ./checklocale [Latin1/9] If there's no literal copyrightsymbol at the end of this sentence, then your terminal/terminalemulator/font is not ISO8859-1/15 ready: © - Current environment settings: LANG = "nonsense" LC_CTYPE = "iso_8859_1" LC_MESSAGES = "POSIX" - Implicitly setting all locale categories with LANG failed. You might want to unset/fix it now and/or set supported categories instead. - Setting LC_CTYPE to "iso_8859_1" succeeded. Testing LC_CTYPE with isprint(): # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ - Implicitly setting LC_NUMERIC by LANG failed. - Implicitly setting LC_TIME by LANG failed. - Implicitly setting LC_COLLATE by LANG failed. - Implicitly setting LC_MONETARY by LANG failed. - Setting LC_MESSAGES to "POSIX" succeeded. - Testing LC_MESSAGES with perror() for EAGAIN, that is, a libc message. Message catalogs are in /usr/share/locale, according to bindtextdomain(). perror() tells: Resource temporarily unavailable strerror() tells: Resource temporarily unavailable
fixed
", btw.
LC_COLLATE
, be very careful with
sort(1)
ing , tr(1)
ing , and using
character ranges (shell -c 'echo [A-Z]*'
).
Likely, you get unexpected but probably valid results, depending on shell, system and certainly locale:
$ touch A B C a b c $ LC_COLLATE=C shell -c 'echo [A-Z]*' A B C $ LC_COLLATE=en_US shell -c 'echo [A-Z]*' A a B b C c $ ls * | LC_COLLATE=C sort A B C a b c $ ls * | LC_COLLATE=en_US sort A a B b C c
If you have set LANG
, then you might want to add LC_COLLATE=C.
(German lang readers: [cert.uni-stuttgart.de] Locale-Einstellungen mit überraschenden Auswirkungen. Und ein Thread speziell zum eigentlichen LC_COLLATE-Problem in de.comp.os.unix.shell, startend mit <9qd9ot0e1b4cpboupq7p78ch9o4ub6vcb1@4ax.com>.)
Using LC_CTYPE
will do no harm here usually. But be careful
about security issues. One might imagine security relevant
characters to be encoded in a way, that a program doesn't recognize
them, e.g. "../" using Unicode instead of ASCII.
LC_MESSAGES
, might be vulnerable to
buffer overflows.
See also
www.cert.org, "Vulnerability in Natural Language Service" about this.
set convert-meta off, set input-meta on, set output-meta on,
set meta-flag on
(synonym for input-meta).bash(1)
about INPUTRC then. This is required if
you cannot get working locale support for any reason.
If setlocale(3)
is not available at all, readline
accepts the special values "iso8859[1-10]" and "koi8r",
see bash-2.x/lib/readline/nls.c, "legal_lang_values[]".
LC_CTYPE
which is not supported on OpenBSD, though.
Additionally nvi knows an option to force printing of certain characters
anyway. However most versions of nvi suffer from a bug and you need the
following in ~/.nexrc or alike:
set print="<printable characters>" set print=where
<printable characters>
is just all the
literal characters you want, e.g. äåâàáöôø...
It's fixed in OpenBSD 2.8.
LC_CTYPE
works as expected: dumpcs(1) prints all printable characters.
dtlogin(1)
, see the according
manpage, /usr/dt/config/ and /etc/dt/config/.
locale -kc LC_CTYPE
",
and tell you which characters are printable.
From: Christian Weisgerber Newsgroups: de.comp.os.unix.bsd Subject: Re: Umlaute im vi (pcvt) ? Date: Mon, 26 Mar 2001 16:01:47 +0000 (UTC) Message-ID: <99np5b$1bup$1@kemoauc.mips.inka.de> (original link)
The 2nd edition of ksh93-l (l+) (still called 2001-07-04) fixes this problem.
Eventually, Unicode is the way to go for a really useful encoding (however the problem with locales would still remain to some degree). Don't miss the following:
With credits to: Jürgen Dollinger, Christian Weisgerber, Heiko Schlenker, Rudolf Hommer, Chris Green, Sven Guckes, Olav Kvittem, Thomas Schultz, Vincent Lefèvre and especially to Mark Glassberg and Alain Bench.
Your own experiences and other feedback are most welcome
comments to <mascheck@in-ulm.de>