Last Updated: 5 Oct 2004
Copyright (©) 2004, Innodata Isogen
Document Language: en
Last Updated: 5 Oct 2004
Copyright (©) 2004, Innodata Isogen
The Innodata Isogen Internationalization (I18N) Support Library is a collection of Java classes that provide fundamental services to document processors for localizing and internationalizing the rendered form of XML documents.
The services provided include:
Language-specific comparators for doing language and locale-appropriate lexical sorting of strings (for example, with the xsl:sort command through Saxon). The generic "getComparator" functions can be bound to any implementation of the Java Comparator interface. The default Comparator implementation is that provided by the ICU4J package ( http://oss.software.ibm.com/icu4j/).
The core functions (I18nService) are processor independent and can be bound to any specific processor through a relatively thin binding layer, as demonstrated by the provided Saxoni18nService class. For example, the I18nService can be bound to Epic Editor through it's Java API, other Java-based XSLT processors, or Java-based user interfaces, or DOM-based XML processors.
The I18N Support Library uses two configuration files, one for static text and one for index configuration. Both are XML documents. As far as the core library is concerned these files can be anywhere. However, the Saxon extension class requires that the files be in specific locations relative to the root of the "i18n home" directory (which is set using the "com.innodata.i18n.home" Java system variable.
For the Saxon extensions, the configuration files must be in the following directories:
This restriction is a side effect of the fact that there's no direct way to pass parameters to the Saxon extension library (except through Java system properties set on the Java command line). If more flexibility is needed, it would be possible to define additional system properties for specifying the exact locations of these configuration files.
The static text database document consists of two main parts: the "contexts" and " attribute maps". The contexts are primarily intended to map element types to their text before and, if needed, text after. However, the contexts can include entries with arbitrary string keys, for example, for strings that have no associated element type. The attribute maps map values of enumerated attributes to specific strings.
The static text database configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/static_text_database".
The <contexts_common>
element
contains the context entries, consisting of one more <context>
elements.
Each <context>
element has a <lookup_key>
,
which contains the string by which they context is looked up. This can be
anything, but values that are the same as element type names can be accessed
using the getGeneratedText functions that take an element as one of their
arguments. By convention, non-element-type name keys are prefixed with "#"
to ensure that they do not conflict with any element type names (XML names
cannot start with "#").
Following <lookup_key>
is one <text_before>
and
one <text_after>
element. Each of these is either empty or
has a <default_item>
element and zero or more <item
>
elements.
The <default_item>
element defines the default
value to be used when there is no item for a specific language. This can either
be an useful value, or a string like "{toc not translated}" which will provide
a clear visual indicator of a missing translation.
Each <item>
element
provides the translation for a single language, specified using the xml:lang=
attribute.
A typical context element is:
<context> <lookup_key>#full_stop</lookup_key> <text_before> <default_item>.</default_item> <item xml:lang="zh-CN">。</item> <item xml:lang="zh-HK">。</item> <item xml:lang="zh-TW">。</item> </text_before> <text_after/> </context>
This example defines the character to use for full stop (period) in various languages. This might be used in constructing cross reference strings, for example.
The back-of-the-book (botb) index rules configuration file lets you define the alphabetic groups for each language, as well as defining the collation (sorting) rules for the language, if necessary. Grouping rules can be defined by enumerating each character or character sequence for each group or, for languages with lots of characters, such as ideographic languages, you can define groups by specifying the first member of each group (and the last member of the last group).
The back-of-the-book index configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/botb_index_config".
The element types involved are:
botb_index_rules
metadata
index_config
national_language
description
collation_spec
sort_method
group_definitions
term_group
). Each group must have at least a group key.
If the sort method is "group by members", it must also contain an explicit
list of group member characters (group_members)
. Groups can also
have a group label that is different from the group key, and, if necessary,
a group sort key that is different from either the group label or group key.
If only the group key is specified, it is also used as the group label and
group sort key. If a group label is specified, it is used as the group sort
key if no explicit sort key is defined. Note that any character that does
not sort into one of the defined groups will be grouped into the "Symbol/Numeric"
group (group key "#NUMERIC").
term_group
group_key
group_label
group_sort_key
group_members
char_or_seq
elements to enumerate the characters
within the group. The group_members
element should not be used
if the sort method is "sort between keys", except for the last group, which
must specify the last_member
element to indicate the last member
of the last group.
char_or_seq
char_or_seq
element would contain one character, one for
each each lowercase and uppercase letter. For languages like Spanish, where
two or more characters are treated as a single character for sorting and grouping,
you would specify multiple characters within a single group, e.g. <char_or_seq>ch</char_or_seq>
.
last_member
group_members
, identifies the last member of
the last group for indexes that use the "sort between keys" sort method (e.g.,
the ideographic languages).
The sample index configuration document provides examples of index configurations for alphabetic, sylabic (Korean), and ideographic languages, showing how to configure each type of language. The configurations for these languages are discussed in more detail below.
NOTE: The index configuration mechanism has been implemented to use a single XML document instance to hold the configurations for all the languages needed. If you find it convenient to put each language's configuration is a separate file, you can use normal XML external parsed entities to do this. While it hasn't been done, it would not be difficult to implement an XInclude-style inclusion mechanism if there is a strong requirement for it.
The English index configuration is
the simplest configuration, as it requires nothing more than a set of groups,
each consisting of two single-character char_or_seq
elements,
one for the lowercase form of a letter, one for the uppercase form. There
is no special collation specification or sorting method. The English index
configuration must always be present and is used as the fallback configuration
for any language for which no explicit configuration is found and for grouping
and sorting English words (the current code base assumes that words not in
the document's base national language will be in English--that is, the current
code base does not provide for a Chinese document that contains Spanish words
that need to sort according to the Spanish index rules).
The English
index configuration can be used as the base for any other latin-based language--just
copy the index_config
element, change the national language
value, and adjust the groups as necessary.
The Spanish index configuration
demonstrates using char_or_seq
to define a group as having a
multi-character sequence as a member. In Spanish, "ch" is treated as a single
character for the purposes of grouping and sorting, so the Spanish configuration
differs from the English in having this additional entry:
<term_group> <group_key>CH</group_key> <group_members> <char_or_seq>ch</char_or_seq> <char_or_seq>CH</char_or_seq> </group_members> </term_group>
Note that it is not necessary to define all the possible case combinations of the character group (e.g., "Ch", "cH"), just the all lowercase and all uppercase versions.
For grouping and sorting, this definition causes all words starting with "ch" to be grouped and sorted all words starting with "c" and followed by any character other than "h".
Note also that this treatment of "ch" must be defined in the Java collation rules for the language. In the case of Spanish (and all or most other European and East European languages), the appropriate collation rules are provided by the standard Java distribution.
The Simplified Chinese index configuration demonstrates several features. Simplified Chinese, as an ideographic language, uses at least 40,000 characters, grouped and sorted alphabetically according to their Pin-Yin transliteration. For example, the character for "horse" is transliterated as "ma" (ignoring tone indicators) in Pin-Yin. Thus, words starting with this character will be grouped under "M" and sorted before any character that transliterates as "mi".
Because
of the large number of characters it would be impractical (but not impossible)
and inefficient to enumerate the members of each group. Instead, Chinese (and
all the other ideographic languages) use the "sort between keys" sort strategy,
as indicated by the <sort_between_keys>
element within the <sort_method>
element.
In
addition, the editorial style for Simplified Chinese is that English words
sort before Chinese words, so that the English word "math" would sort before
all Chinese characters within the "M" group. This is indicated by the <sort_english_before>
within <sort_method>
.
In most non-latin languages English words are sorted after the words in the
main language, so that is the default.
Each group has a group key, which is the first Chinese character within that group, and a group label, which is the latin character label for that group ("A", "B", "C", etc.). Because the group key is used as the group sort key by default, there is no need to specify a separate group sort key.
Each group has a <group_members>
element
but it is empty for all but the last member. For the last member, the <group_members>
element
contains a <last_member>
element that contains the last Chinese
character member of the last group. Without this specification, any characters
that are defined as sorting after the ideographs would also be sorted into
the last group.
The I18N library depends on the use of ICU4J 3.2 (or later) collators for all languages. These collators appear to provide appropriate collation rules for all languages. Note that while the general index configuration mechanism provides a way to specify the use of local collation rules, the current version of the library does not support such rules with the ICU4J collators. This may be fixed at some point in the future (it's simply a matter of refactoring the ICU4J collator factory).
Traditional Chinese indexes are sorted and group by character stroke count and then by radical (the base graphical element within a character). The group labels are the Characters for "one-stroke character", "two-stroke characters", and so on. Thus, where for Simplified Chinese the group label and sort key are the same, here the group label and sort key are different. The sort key is the same as the group key so there is no need to specify a separate group sort key.
To install the I18N Support library, simply unpack the package, creating the subdirectories. The `i18n_support.jar file includes a manifest that automatically adds the 3rd-party libraries in the lib/ directory to the Java class path. As long as the relative relationship is maintained you do not need to set or extend the Java CLASSPATH environment variable or command-line parameter to include the 3rd-party jars, only the i18n_support.jar itself..
The configuration files can be in any location, although the Saxon extension class (Saxoni18nService) expects them to be in config/ below the root of the distribution (the com.innodata.xml.i18nhome Java system property). If you change the organization of the configuration files you must update the Java source to reflect those changes.
To use the Saxon extensions you must declare an extension to use for the extension functions and bind them to the com.isogen.i18n.I18nService class, e.g.:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:isoi18n="java:com.isogen.saxoni18n.Saxoni18nService" >
You can then use the static methods defined in the Saxoni18nService class as XSLT extensions functions, e.g.:
<xsl:value-of select="isoi18n:getGeneratedTextForKeyBefore('#toc', $currentLang)"/>
See the Java API docs for the details of the extension functions provided.