view Side-By-Side changes
Date: Tue, 09 Apr 2002 09:09:35 GMT
Server: Apache/1.3.20 (Unix)
Last-Modified: Wed, 09 Dec 1992 04:33:00 GMT
ETag: "3dde3f-988d-2b2576fc"
Accept-Ranges: bytes
Content-Length: 39053
Connection: close
Content-Type: text/plain
draft X.400 use of extended character sets Apr 92
X.400 use of extended character sets
Fri Nov 6 15:13:56 MET 1992
Harald Tveit
Network Working Group H. Alvestrand
Request for Comments: 1502 SINTEF DELAB
Harald.Alvestrand@delab.sintef.no
August 1993
X.400 Use of Extended Character Sets
Status of this Memo
This draft document is being circulated for comment.
If consensus is reached it may be submitted to the RFC editor as a
Proposed Standard specifies an IAB standards track protocol specificiation, for use in X.400 in the
Internet.
Please send comments to the author, or to the RARE WG-MSG list
<wg-msg@rare.nl>.
The following text is required by the Internet-draft rules:
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its
Areas,
community, and its Working Groups. Note that other groups may also
distribute working documents as Internet Drafts.
Internet Drafts are draft documents valid requests discussion and suggestions for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or "work in progress." improvements.
Please check the I-D abstract listing contained in each Internet
Draft directory refer to learn the current edition of the "IAB Official Protocol
Standards" for the standardization state and status of this or any other
Internet Draft.
Alvestrand Expires May 6 93 [Page 1]
draft X.400 use protocol.
Distribution of extended character sets Apr 92 this memo is unlimited.
1. Introduction
Since 1988, X.400 has had the capacity for carrying a large number of
different character sets in a message by using the body part
"GeneralText" defined by ISO/IEC 10021-7.
Since 1992, the Internet also has the means of passing around
messages containing multiple character sets, by using the mechanism
defined in RFC-MIME.
This document RFC defines a suggested method of using "GeneralText" in order
to harmonize as much as possible the usage of this body part.
2. General principles
2.1. Goals
The target of this memo is to define a way of using existing
standards to achieve:
(1) in the short term, a standard for sending E-mail in the
European languages (Latin letters with European accents,
Greek and Cyrillic)
(2) in the medium term, extending this to cover the Hebrew and
Arabic character sets
(3) in the long term, opening up true international E-mail by
allowing the full character set specified in ISO-10646 to be
used.
Alvestrand [Page 1]
RFC 1502 X.400 Use of Extended Character Sets August 1993
The author believes that this document gives a specification that can
easily accomodate the use of any character set in the ISO registry,
and, by giving guidance rules for choosing character sets, will help
interworking.
Alvestrand Expires May 6 93 [Page 2]
draft X.400 use of extended character sets Apr 92
2.2. Families of character sets
2.2.1. ISO 6937/T.61
ISO 6937 is a code technique used and recommended in T.51 and T.101
(Teletex and Videotex service) and in X.500, providing a repertoire
of 333 characters from the Latin script by use of non- spacing
diacritical marks. It corresponds closely to CCITT recommendation
T.61.
The problem with that technique is that the character stream comes in
two modes, i.e i.e., some characters are coded with one byte and some
with two (composite characters). This makes information processing
systems such as an E-mail UA or GW more complex.
It is also not extensible to other languages like Korean or Chinese,
or even Greek, without invoking the character set switching
techniques of ISO 2022.
2.2.2. ISO 8859
ISO 8859 defines a set of character sets, each suitable for use in
some group of languages. Each character in ISO 8859 is coded in a
single byte.
There are currently 9 11 parts of ISO 8859, plus a "supplementary" set,
registered as ISO IR 154. All Most languages using single-byte characters
can be written in one or another of the ISO 8859 sets. There are
sets covering Greek, Hebrew and Arabic, but there is still
controversy over the problem of the rendering direction for Hebrew
and Arabic.
All the ISO 8859 sets include US ASCII US-ASCII as a subset. All use 8 bits.
ISO 8859 is regarded by many as a solution; for instance, the X
windows system now comes with ISO-8859-1 as the "standard" character
set, with the possibility of specifying others. But since the same
applications often do not support character set switching within
text, it is problematic to use these in a truly multilingual
environment. (Also, most fonts claiming to be "ISO- 8859-1" in X11R5
are actually 7-bit fonts. The implied lie is very unfortunate.)
Alvestrand Expires May 6 93 [Page 3]
draft 2]
RFC 1502 X.400 use Use of extended character sets Apr 92 Extended Character Sets August 1993
It turns out to work fine, however, if the second language is
English, since this can be written in all ISO 8859 sets.
The parts 3 and 4 have not seen wide acceptance, and it is expected
that they will be discarded. They should therefore not be used.
Note that an ISO 8859 set is actually 2 sets in the ISO sense: US-
ASCII in the G0 set and another character set in the G1 set. The
overloading of the word "character set" is unfortunate, but
traditional.
2.2.3. ISO 10646
At the moment of writing, ISO 10646 has just been accepted as an
International Standard. It is basically a 32-bit character set, with
all of the currently used characters being numbered by the first 16
bits, leaving some room for expansion.
It is not possible to use ISO 10646 as a normal character set,
because it does not conform to the rules for usage of byte values set
down in ISO 2022 and other places; it uses the "control space" for
(parts of) graphic character codes.
There are a number of ways to encode ISO 10646 characters "on the
wire". There are methods within the ISO 2022 standard to switch to
these, either as "other coding system without return" or as "other
coding system with return" (that is, you can go back from it to the
one you came from using an ISO 2022 escape sequence).
The following registrations have been made:
ISO 10646 UCS-2 Level 1 has been registered with ESC 2/5 2/15 4/0,
ISO 10646 UCS-4 Level 1 has been registered with ESC 2/5 2/15 4/1,
The following are applied for:
Reg# Escape sequence Standard/Sponsor Description
174 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 2
175 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 2
176 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 3
177 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 3
178 ESC 2/5 F ISO/IEC 10646 UTF-1
<< NOTE: The registration numbers for UCS-2 level 1 and UCS-4
level 1 are not known. Neither are the assigned final characters
for the other sets. Information requested!>>
Alvestrand Expires May 6 93 [Page 4]
draft 3]
RFC 1502 X.400 use Use of extended character sets Apr 92 Extended Character Sets August 1993
This character set will become very important in the future, but at
the moment, few systems are able to support this directly.
The GeneralText body part can be used for carrying any of these
character sets.
2.3. Body parts that can be used in X.400
At the moment, no established way of transferring a full set of
characters in X.400-based E-mail exists. In the future, it is likely
that a new body part, based in ISO 10646, will be
available; it is, however, dangerous to try available, or
GeneralText may be able to specify this body
part before use ISO 10646 is final. 10646, but this matter has not yet
been clarified.
In the short term, the deployed and available body parts are:
(1) IA5Text
(2) For X.400/84: ISO6937Text and Teletex
(3) For X.400/88: GeneralText
IA5Text is the method of choice for E-mail that contains only
characters from IA5 (equivalent to ASCII). US-ASCII).
The ISO6937Text body part is defined in the ISO DIS documents
corresponding to X.400(84) [MOTIS-86]; [10]; these never became a standard, so
they are now quite difficult to find. It is in principle limited to
using text that can be presented in ISO 6937, but since ISO 6937
refers to the ISO 2022 method of changing character sets, it is
theoretically possible to use any ISO registered character set, but
there is no facility for announcing the character sets used. This
makes interworking with equipment that does not support the same
character sets complex.
It is still, however, the only body part suitable for carrying
non-paginated non-
paginated text in non-basic character sets in X.400(84).
Teletex, which is identical in all versions of the X.400 standard,
has the same problem of implicit ISO6937, but has the added problem
that it also specifies a page format, with, for instance, a left
margin of 5 character positions. This is often not
Alvestrand Expires May 6 93 [Page 5]
draft X.400 use of extended character sets Apr 92 desirable.
The details of Teletex are specified in recommendation T.51 and its
relatives.
GeneralText is defined in ISO 10021-8, the part of [MOTIS] [9] that
corresponds to CCITT recommendation [X.420]. [11]. It is an Extended body part, so no modification
Alvestrand [Page 4]
RFC 1502 X.400 Use of Extended Character Sets August 1993
part, so no modification to CCITT implementations is needed to carry
it.
GeneralText is suitable for interchange, since it has got proper
announcement facilities. It can use any number of character sets, and
announces them both in the Encoded Information Types of the X.400
envelope and the parameters of the body part.
We recommend this body part for carrying unformatted text in
X.400/88.
3. GUIDELINES FOR THE GENERATION OF GENERALTEXT
3.1. Formal definition of GeneralText
A GeneralText message is a byte stream that contains characters and
character switching sequences according to [ISO 2022]. [12].
The X.400 ASN.1 definition of the GeneralText body part is:
general-text-body-part EXTENDED-BODY-PART-TYPE
PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-text
DATA GeneralTextData
::= id-et-general-text
GeneralTextParameters ::= SET OF CharacterSetRegistration
CharacterSetRegistration ::= INTEGER (1..32767)
GeneralTextData ::= GeneralString
The definition is from ISO/IEC 10021-7 [MOTIS], [9], Annex I, with
modifications made in the MHS Implementor' Guide, version 8,
Alvestrand Expires May 6 93 [Page 6]
draft X.400 use of extended character sets Apr 92 chapter
3.6.3, bullet F130. It does not appear in the CCITT version of the
standards.
3.2. Brief description of ISO 2022 character set switching
There are 4 graphic character sets active at any time in a
GeneralText message, called G0, G1, G2 and G3. In addition, there are
2 control character sets, called C0 and C1.
At any moment, one of the sets G0-G3 is active in code positions 2/1
to 7/14, and another is active in code positions 10/0 to 15/15. The
setting is achieved by so-called "locking shift" sequences.
(Formally, code positions 2/0 and 7/15 are reserved for "space" and
"DEL" respectively, and only 94-character character sets can be used
Alvestrand [Page 5]
RFC 1502 X.400 Use of Extended Character Sets August 1993
in G0. In practice, this restriction is sometimes ignored)
Single characters from the non-active sets may be invoked by the use
of "single shift" sequences.
The control character sets always occupy the code positions 0/0 to
1/15 (C0) and 8/0 to 9/15 (C1).
The character sets currently active as G0-G3 and C0-C1 may be changed
using "character set designating sequences".
At the beginning of a GeneralText message, one must always assume
that set 2 (IA5) is active as G0, shifted into the lower half, that
set 1 (standard) is active as C0, and that no G1-G3 or C1 set is
invoked. This is specified in the definition of "GeneralString" in [X.209],
[5], the definition of ASN.1 encoding (section 23.5.2).
If this is not a suitable initial state, a message must always start
with the necessary announcers and escape sequences to designate and
invoke the character sets that are actually used. The character sets
in use may be changed later in the message by use of escape
sequences.
The parameters of a GeneralText message always list all the character
sets used, by quoting their ISO reference numbers.
It is impossible to use a character set not registered with ISO in
Alvestrand Expires May 6 93 [Page 7]
draft X.400 use of extended character sets Apr 92 a
GeneralText message.
It is also impossible to decide on the true meaning of a byte in a
GeneralText message without scanning the whole message for shift and
escape sequences.
3.3. How to use the character sets
RECOMMENDATION:
When the text to be rendered is representable in one of the character
sets of ISO-8859, the G0 set should be set to ISO 646 International
Reference Version (1991), also called US-ASCII, ISO-IR-6.
The older character set ISO-IR-2, ISO 646 IRV(1983), should NOT be
used. This means that the escape sequence ESC 2/8 4/2 (designating ASCII
US-ASCII as G0) should always occur at the beginning of the message.
The G1 set should be set to the character set identified by the
relevant ISO-8859 part. G2 and G3 are not used.
Alvestrand [Page 6]
RFC 1502 X.400 Use of Extended Character Sets August 1993
This corresponds to the first level of ISO 4873 usage.
For the currently defined parts of ISO 8859, the character set
designations for the G1 set are (relative to ISO 8859:1987):
Part ISO IR name Escape sequence Remarks
for G1 use
1 ISO-IR-100 Esc 2D 41 West Europe (Latin-1)
2 ISO-IR-101 Esc 2D 42 East European (Latin-2)
3 ISO-IR-109 Esc 2D 43 (Latin-3)
4 ISO-IR-110 Esc 2D 44 (Latin-4)
5 ISO-IR-144 Esc 2D 4C Cyrillic
6 ISO-IR-127 Esc 2D 47 Arabic
7 ISO-IR-126 Esc 2D 46 Greek
8 ISO-IR-138 Esc 2D 48 Hebrew
9 ISO-IR-148 Esc 2D 4D Baltic, Turkish (Latin-5)
10 ISO-IR-157 Not listed Sami (Latin-6)
The escape sequence for 8859-10 (Latin-6) is not listed in RFC 1345.
NOTE: The use of ISO 8859-3 and ISO 8859-4 is NOT recommended if
other possibilities exist.
Alvestrand Expires May 6 93 [Page 8]
draft X.400
NOTE: There is a debate about the Arabic and Hebrew character sets.
These languages are normally read right to left, but encodings have
been done in both "visual" (left to right) and "phonetic" (right to
left) ordering, there is significant disagreement about what the
"right" way to do it is, and the character sets mentioned do not
specify it. So, one should be careful not to use of extended these character sets Apr 92
until a standard is agreed upon, or the result will probably be
unreadable (siht ekil).
(Note that there is some confusion as to what parts are actually
standardized; the Norwegian standards institute reports that only
part 1, 2, 3, 4, 6, 7 and 8 are currently standards. Other reports
claim that both 8859-10 and 8859-11 are standards, and I definitely
think that 8859-9 is.)
NOTE: ISO has not ruled out the possibility of changing the ISO 8859
standard. This would involve changing the registry information in
this table, so this should be assumed valid for ISO 8859 versions
that are current in 1993.
The G1 set should be permanently shifted into the upper half of the
code page.
Alvestrand [Page 7]
RFC 1502 X.400 Use of Extended Character Sets August 1993
When the text is not representable in one of the ISO-8859 character
sets, the following rules may be applied:
(1) If any Latin characters are used, keep IA5 as the G0 set.
(2) If a mainstream character set is used (Greek, Cyrillic,
Hebrew, Arabic), designate this as the G1 character set,
and permanently shift this into the upper half of the code
page (LS1R).
EXCEPTION: The Japanese community has a long tradition of
switching between the Japanese 16-bit character set
ISO-IR-87 and USASCII US-ASCII as the G0 set. See [RFC-2022-JP] [7]
for details. If ISO-IR-87 is used, that technique should be
used instead of the one recommended here.
(3) If occasional extensions to a character set that is
basically Latin occur (like accents, national variants
and so on), and these are available in a single character
set, designate the relevant set as G2 and use single
shift (SS2) to invoke characters from this character set.
The ISO 8859 supplementary set, ISO-IR-154, is recommended
for this purpose.
This corresponds to the ISO 4873 "second level" application.
(4) If two non-Latin character sets are used, the second should
be designated as G3, and shifted into the upper half of the
code page by the use of Locking Shift 3 Right (LS3R).
This corresponds to the ISO 4873 "third level" application.
(5) If avoidable, use of character sets with floating accents,
like ISO 6937, should be avoided.
(6) The shifts changing the lower half of the code table (SI/SO,
LS2 and LS3) should NOT be used.
Alvestrand Expires May 6 93 [Page 9]
draft X.400 use of extended character sets Apr 92
RATIONALE: Keeping the G0 set reserved for ASCII US-ASCII will ensure that
text in ASCII US-ASCII has the same bit representation always.
The use of the upper code page for other scripts ensures that both
text in these languages and text of this type mixed with English can
be represented without the use of shift sequences.
If the language and/or content of a text is completely unknown,
chapter 5 gives an algorithm that may be used to decide upon the
character sets. This might, for instance, be suitable for use at
Alvestrand [Page 8]
RFC 1502 X.400 Use of Extended Character Sets August 1993
automatic mail gateways.
NOTE: At the time of this writing, few applications that use ISO 4873
level 2 and level 3 encoding exist. It has been estimated that
implementing them in an application that already uses a rich
repertoire of characters is a matter of programmer-days, not
programmer-months, but this has not been proven.
4. GUIDELINES FOR THE RENDERING OF GENERALTEXT
As a basic rule, one should NOT assume that any of the rules above
are followed.
An user agent capable of rendering GeneralText should:
(1) ALWAYS be able to identify and render characters in IA5, no
matter how they are designated and invoked.
(2) ALWAYS be able to identify and render characters in the
"native" character sets, no matter how they are designated
and invoked.
(3) ALWAYS indicate the presence of characters that cannot be
adequately represented on the current output device.
(4) NEVER render a character in an unknown or unrepresentable
character set by displaying the character in the same bit
position in the native character set.
(5) PREFERABLY be able to identify and render characters that are
the same as characters in the "native" character sets, even
though they are designated and invoked as part of other
Alvestrand Expires May 6 93 [Page 10]
draft X.400 use of extended character sets Apr 92
character sets. This applies in particular to the
"invariant" part of ISO 8859, parts 1 through 6.
(6) PREFERABLY be able to combine the floating accents of ISO
6937 with their base characters for suitable rendering using
the capabilities of the current output device.
(7) PREFERABLY be able to display text both in a mode using
fallbacks for nonrenderable characters and in a mode
designating nonrenderable characters as such.
(8) PREFERABLY be able to save the content of a GeneralText
message to a file or other suitable media, saving all
character set information, for later processing by other
means. It is not illegal to render the character set
information into a different format; however, it should be
Alvestrand [Page 9]
RFC 1502 X.400 Use of Extended Character Sets August 1993
noted that it is easy to lose vital information if the format
chosen for representing character sets does not offer the
possibility of referencing all character sets in the ISO
registry of character sets.
These requirements also apply to gateways that transform the message
into some other format, for example a gateway that transforms a
message into MIME using [RFC-2022-JP] [7] for the purpose.
5. RECOMMENDATION FOR SELECTION OF CHARACTER SETS
5.1. Algorithm for selection of character sets
When one has text in which characters from several character sets
occurs, and wants to process this into a GeneralText document, it is
often hard to guess right at the character sets to select.
The following paragraphs give an algorithm that can be started at the
beginning of a message, and at the end of it, return a set of
character sets that can be used as G0..G3 character sets, OR an
indication that the task is impossible.
VARIABLES:
Alvestrand Expires May 6 93 [Page 11]
draft X.400 use of extended character sets Apr 92
UsedSets
The set of character sets that MUST be used for this message
UsableSets
The set of character sets that MAY be used for this message.
Each set also contains a counter for each character position.
PossibleSets
The set of all the character sets known to be usable in the
destination format.
ALGORITHM:
1) Add IA5 (ISO-IR-6) to the UsedSets (as G0).
2) Get the next character of the text. If the text is
completely analyzed, go to FINISHED
3) If it is in the UsedSets, go to 2).
4) Find the set of character sets from PossibleSets in which the
character occurs. If it does not occur in any, report
failure.
5) If it is in a single character
Alvestrand [Page 10]
RFC 1502 X.400 Use of Extended Character Sets August 1993
5) If it is in a single character set in PossibleSets only, add
this set to UsedSets, and go to 2).
6) If it is in more than one character set, add these to
PossibleSets (if not already present), and increment the
counter for that character in all the sets. Go to 2).
FINISHED)
1) (FINAL SELECTION) Remove any character set in UsedSets from
PossibleSets.
Zero the counters for any character in PossibleSets that also
occurs in UsedSets.
WHILE (more characters left)
Select one character set and move it from PossibleSets to UsedSets.
UsedSqets.
Zero the counters for all characters in this set in the other
PossibleSets.
END WHILE
This step can be "tuned" any way you want, for instance by
Alvestrand Expires May 6 93 [Page 12]
draft X.400 use of extended character sets Apr 92
choosing the character sets most likely to be understood at
the destination first, choosing the character sets covering
the most characters first, avoiding multi-byte character sets
as long as possible, or any other scheme suitable for the
application.
5.2. WHAT TO DO ON FAILURE
Failure will occur in this schema if a character is found that is not
in the PossibleSets. It may then be handled in one of the following
ways:
(1) Replace the character with the SUB control character
(2) Replace the character with Keld Simonsen Mnemonics. Mnemonics [8].
This is a reversible transformation as long as the
recipient is aware that it has been used, but requires
passing out-of-band information to indicate this.
(3) Replace the lost characters with any suitable fallback or
mnemonic scheme intended for human understanding
(4) Bounce the message/refuse the conversion/give up.
The action to be taken may be different based on the percentage of
"lost" characters.
Alvestrand [Page 11]
RFC 1502 X.400 Use of Extended Character Sets August 1993
If the message has "controls" like "conversion with loss prohibited",
only the last possibility may be used.
5.3. RECOMMENDED CHARACTER SETS
There are 2 steps in the algorithm above that are left for local
judgement:
(1) Selection of the sets to appear in PossibleSets.
(2) The algorithm for deciding which character set to select in
step 9.
In the context of generating X.400 GeneralText messages, the
following is recommended:
Alvestrand Expires May 6 93 [Page 13]
draft X.400 use of extended character sets Apr 92
Sets in PossibleSets:
ISO-IR-6 Esc 28 42 (G0) US-ASCII, IA5, ISO646
ISO-IR-100 Esc 2D 41 (G1) ISO-8859-1 West Europe
ISO-IR-101 Esc 2D 42 (G1) ISO-8859-2 Central/Eastern Europe
ISO-IR-144 Esc 2D 4C (G1) ISO-8859-5 Cyrillic
ISO-IR-127 Esc 2D 47 (G1) ISO-8859-6 Arabic
ISO-IR-126 Esc 2D 46 (G1) ISO-8859-7 Greek
ISO-IR-138 Esc 2D 48 (G1) ISO-8859-8 Hebrew
ISO-IR-148 Esc 2D 4D (G1) ISO-8859-9 Baltic/Nordic/Turkish Turkish
The following multi-byte character sets are recommended:
ISO-IR-87 (Japanese JIS C6226-1983) Esc 24 29 42 (G1)
ISO-IR-149 (Korean KS C 5601-1989) Esc 24 29 43 (G1)
ISO-IR-58 (Chinese GB 2312-80) Esc 24 29 41 (G1)
It is a STRONG recommendation that character sets not listed above,
which do not add any new characters to the total set of characters
given by the character sets above, should NOT be used in X.400
interchange.
ISO-IR-87 is the Japanese character set that is allowed in a Teletex
string, such as the subject field.
NOTE: ISO-IR-87 has been "superseded" by ISO-IR-168, which allows two
extra Kanji characters. Any application that handles ISO-IR-87 should
also be able to handle ISO-IR-168.
Algorithm for selecting character sets:
Start at the top of the list above, and add each set only if it is
needed.
5.4. Selecting a character set based on language
If the most common language of the environment in which it is used
is known, the following character sets are recommended.
The table of Latin-script languages is based on work by Johan van
Wingen. <BUTPAA@rulmvs.leidenuniv.nl>. The others are best
guesses by the author.
The tables of character sets prepared by Keld Jorn Simonsen
Alvestrand Expires May 6 93 [Page 14]
draft 12]
RFC 1502 X.400 use Use of extended Extended Character Sets August 1993
6. REFERENCES
[1] Information technology - ISO 8-bit code for information
interchange - Structure and rules for implementation, Third
edition, 1991-12-15.
[2] Information technology - 8-bit single-byte coded graphic
character sets Apr 92
<keld@dkuug.dk> (RFC-KELD) were invaluable in matching (parts 1-11; the data on
languages to parts have different dates, the data on character sets.
Again, these
ones referenced here are intended from RFC 1345).
[3] Information technology - Coded graphic character set for guidance, not enforcement; there is
considerable prestige atttached to such recommendations in other
contexts, text
communication (parts 1 and it is therefore likely that each language group will
make appropriate decisions on this subject. The table below is
intended as a compilation of existing knowledge, again on 2; part 2 dated 1983-12-15).
[4] Code for the
principle that it is better to say something than to say nothing.
The language codes (for those languages that have codes) come from representation of names of languages. 1988 version.
[5] CCITT Recommendation X.209(1988): Specification of Basic
Encoding Rules for Abstract Syntax Notation One (ASN.1).
Technically aligned with ISO 639.
NOTE: 8825 and ISO 639 is a 8825/AD 1.
[6] Information Technology - Universal Multiple-Octet Coded
Character Set (UCS) - ISO 10646.
[7] Murai, J., Crispin M., and E. van der Poel, "Japanese Character
Encoding for Internet Message Bodies", RFC 1468, Keio
University, Panda Programming, June 1993.
[8] Simonsen, K., "Character Mnemonics & Character Sets", RFC 1345,
Rationel Almen Planlaegning, June 1992.
[9] Information Technology - Text communication - Message- Oriented
Text Interchange Systems (MOTIS) - ISO 10021 - October 1988.
[10] ISO DIS documents describing X.400/84 with slight extensions.
Now very incomplete list of the world's languages
(perhaps 10 or 20 % according hard to some experts), get copies of, since they failed to become
ISes.
[11] CCITT Recommendation X.420 (1988), Interpersonal Messaging
System.
[12] International Standard--Information Processing-- ISO 7-bit and is undergoing
revision. The only reason for using it is that it is the only
ISO-standardized shorthand notation for languages available at the
moment.
Language 1 2 3 4 5
------------------------------------------------------------
sq Albanian X X X
eu Basque X X
br Breton X
hr Croatian X
cs Czech X
da Danish X
eo Esperanto X
fo Faeroese X
fi Finnish X X X
fy Frisian X
?? Gaelic X
gl Galician X X
de German X X
hu Hungarian X
is Icelandic X
ga Irish X X X
it Italian X
no Norwegian X X
pl Polish X
pt Portuguese X
?? Rhaetian X
ro Romanian X
sk Slovak X
Alvestrand Expires May 6 93 [Page 15]
draft X.400 use of extended character sets Apr 92
sl Slovenian X
?? Sorbian X
es Spanish X X
sv Swedish X X
tr Turkish X
Explanation of character set codes
----------------------------------------
1: ISO_8859-1:1987
2: ISO_8859-2:1987
3: ISO_8859-9:1989
4: ISO_8859-supp
5: ISO_8859-2:1987 and ISO_8859-supp
Other languages for which appropriate
8-bit coded character sets are known sets--Code extension techniques, ISO
2022:1986.
7. Security Considerations
Security issues are
listed in the table below.
Language Character set
ar Arabic ISO-8859-6
be Byelorussian ISO-8859-5
bg Bulgarian ISO-8859-5
el Greek ISO-8859-7
en English USASCII
fa Persian ISO-8859-6
iw Hebrew ISO-8859-8
ja Japanese ISO-IR-87 (Japanese JIS C6226-1983)
ko Korean ISO-IR-149 (Korean KS C 5601-1989)
la Latin USASCII
lo Laotian ISO-IR-166
ru Russian ISO-8859-5
sw Swahili USASCII
th Thai ISO-IR-166
uk Ukrainian ISO-8859-5
ur Urdu ISO-8859-6
vo Volapuk ISO-8859-1
zh Chinese ISO-IR-58 (Chinese GB 2312-80)
Additional entries not discussed in this table are welcome!
Some languages have only one or a few characters missing. These
are listed below. memo.
Alvestrand Expires May 6 93 [Page 16]
draft 13]
RFC 1502 X.400 use Use of extended character sets Apr 92
Language Extended Character set Missing
Sami ISO-8859-9 C with caron
D with stroke
I with diaeresis
N with acute
Eng
S with caron
T with stroke
Z with caron
kl Greenlandic ISO-8859-1 I with tilde
K with cedilla
U with tilde
cy Welsh ISO-8859-1 W with acute
W with grave
W with diaeresis
Y with grave
Y with circumflex
nl Dutch ISO-8859-1 Ligature IJ
af Afrikaans ISO-8859-1 N preceded by apostrophe
fr French ISO-8859-1 Ligature OE
ca Catalan ISO-8859-1 L with middle dot
According to comments received, the "problem characters" for
Dutch, Afrikaans, French, Greenlandic and Catalan are not in
common use, or may be avoided by use of alternate spelling (like
using "ij" instead of the "Ligature IJ").
For French, Dutch, Catalan and Afrikaans, the character set ISO
6937-2, which uses floating diacritical marks, contains all
required characters.
The following languages can (to the author's limited knowledge) be
written with the current ISO 10646 standard, but with no other
registered character sets:
Language Country(ies) Script(s)
aa Afar Somalia, Ethiopia, Djibouti Latin
ab Abkhazian Georgia Cyrillic
am Amharic Ethiopia Ethiopic
as Assamese India, Nepal Bengali
ay Aymara Bolivia, Peru, Chile Latin
Alvestrand Expires May 6 93 [Page 17]
draft X.400 use of extended character sets Apr 92
az Azerbaijani SNC, Iran, Iraq, Turkey Cyrillic, Arabic
ba Bashkir SNC Cyrillic
bh Bihari India Gujarati (or Kaithi)
bi Bislama Vanuatu, New Caledonia Latin
bn Bengali India Bengali
co Corsican France Latin
fj Fiji Fiji Latin
gd Scots UK Latin
gn Guarani Paraguay Latin
gu Gujarati India Gujarati
ha Hausa Nigeria, Niger, Chad, Sudan,... Latin
hi Hindi India Devanagari
hy Armenian Armenia Armenian
ia Interlingua None (Artificial Language) Latin
ie Interlingue None (Artificial Language) Latin
ik Inupiak USA, Cannada Latin, Cree
in Indonesian Indonesia Latin
ji Yiddish Germany, USA, SNC, Israel Hebrew
jw Javanese Indonesia, Malaysia Latin, Javanese
ka Georgian Georgia Georgian
kk Kazakh SNC, Afghanistan Cyrillic, Arabic
km Cambodian Cambodia Khmer
kn Kannada India Kannada
ks Kashmiri India, Pakistan Arabic
ku Kurdish SNC, Turkey, Iraq, Iran Cyrillic, Arabic
ky Kirghiz SNC, China, Afghanistan Cyrillic, Arabic
ln Lingala CAR, Congo, Zaire Latin
mg Malagasy Madagascar, Comoro Islands Latin, Arabic
mi Maori New Zealand Latin
mk Macedonian Greece, Yugoslavia Greek, Cyrillic
ml Malayalam India Malayalam
mn Mongolian Mongolia Cyrillic, Mongolian
mo Moldavian Romania Latin
mr Marathi India Devanagari
ms Malay Malaysia, Thailand Latin
my Burmese Myanmar Burmese
na Nauru Nauru Latin
ne Nepali Nepal Devanagari
oc Occitan France Latin
or Oriya India Oriya
pa Punjabi India Gurmukhi
ps Pashto (Western) Afghanistan, Iran Arabic
qu Quechua Peru Latin
rm Rhaeto Swizerland Latin
Alvestrand Expires May 6 93 [Page 18]
draft X.400 use of extended character sets Apr 92
rn Kirundi Burundi, Uganda Latin
rw Kinyarwanda Rwanda, Uganda, Zaire Latin
sa Sanskrit India Devanagari
sd Sindhi Pakistan, India, Afghanistan Arabic, Gurmukhi
sg Sangro Central African Republic Latin
si Singhalese Sri Lanka Sinhalese
sm Samoan Samoa, USA, New Zealand Latin
sn Shona Zimbabwe, Zambia, Mozambique Latin
so Somali Somalia, Ethiopia, Djibouti Latin
sr Serbian former Yugoslavia Cyrillic
ss Siswati S. Africa, Swaziland Latin
st Sesotho S. Africa, Lesotho Latin
su Sudanese Sudan Latin
ta Tamil India, Malaysia Tamil
te Tegulu India Telugu
tg Tajik Tajikistan Arabic
ti Tigrinya Ethiopia Latin, Ethiopic
tk Turkmen SNC, Iran, Afghanistan Cyrillic, Arabic
tl Tagalog Phillipines Latin
tn Setswana S. Africa, Botswana, Namibia Latin
to Tonga (3) Mozambique Latin
ts Tsonga Mozambique, Swaziland Latin
tt Tatar SNC Cyrillic
tw Twi (Ewe) Ghana Latin
uz Uzbek (Southern) Afghanistan, Turkey Arabic
vi Vietnamese Vietnam, Cambodia, China Latin
wo Wolof Senegal, Mauritania Latin
xh Xhosa S. Africa Latin
yo Yoruba Nigeria, Togo, Benin Latin
zu Zulu S. Africa, Lesotho, Malawi Latin
The information about languages in ISO 10646 was kindly supplied
by Glenn Adams <glenn@metis.com>
Languages for which the author does NOT know any proper character
set include:
bo Tibetan
dz Bhutani
et Estonian
lt Lithuanian
lv Latvian, Lettish
Alvestrand Expires May 6 93 [Page 19]
draft X.400 use of extended character sets Apr 92
mt Maltese
sh Serbo-Croatian
6. REFERENCES
[ISO 4873]
<<title coming>> 1991 revision. Replaces ISO 2022
[ISO 8859]
[ISO 6937]
[ISO 639]
[X.209]
CCITT Recommendation X.209(1988): Specification of Basic
Encoding Rules for Abstract Syntax Notation One (ASN.1).
Technically aligned with ISO 8825 and ISO 8825/AD 1.
[ISO 10646]
[RFC-2022-JP]
[RFC-KELD]
7. Missing items This section is intended as a memory aid for
the author, and should be empty by the time the RFC is published.
(1) Get exact escape sequence information for ISO 10646
(2) Full titles in the references section
(3) Consider number of lines when listing extra chars in
languages in cleartext
(4) Check Sami character set with Sami school
(5) Locate (Norwegian) editor of revision for ISO 639 and get
language codes for Sorbian and Sami, if possible Sets August 1993
8. Author's Address
Harald Tveit Alvestrand Expires May 6 93 [Page 20]
draft X.400 use of extended character sets Apr 92
(6) Add MOTIS properly to reference list
(7) Add Johan van Wingen's E-mail address
(8) Number and reference entry for RFC-KELD
(9) Check for references to/copies of Johan van Wingen's work
SINTEF DELAB
N-7034 Trondheim
NORWAY
EMail: Harald.Alvestrand@delab.sintef.no
Alvestrand Expires May 6 93 [Page 21]
------------------------------ End of body part 2 14]
----