APERTIUM: AN OPEN-SOURCE SHALLOW-TRANSFER MACHINE TRANSLATION ENGINE FOR RELATED-LANGUAGE PAIRS
Carme Armentano-Oller, Rafael C. Carrasco, Boyan I. Bonev, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez,
Transducens group, Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant, E-03071 Alacant (Spain).
carmentano [at] dlsi.ua.es, carrasco [at] dlsi.ua.es,
bib [at] alu.ua.es, acorbi [at] dlsi.ua.es, mlf [at] ua.es,
mginesti [at] dlsi.ua.es, sortiz [at] dlsi.ua.es, japerez [at]
dlsi.ua.es, gema [at] internostrum.com, fsanchez [at]
dlsi.ua.es
January 24, 2006
This documentation is distributed under the GNU General Public License (http://www.gnu.org/licenses/gpl.html)
Abstract. We briefly describe Apertium: an open-source shallow-transfer machine translation engine, initially aimed at related-language pairs. Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group at the Universitat d'Alacant, such as interNOSTRUM (Spanish-Catalan, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, http://traductor.universia.net). It will be possible to use Apertium to build machine translation systems for a variety of related-language pairs; to that end, the project proposes simple standard formats to encode the linguistic data needed. This paper briefly describes the machine translation engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine.
1. Introduction
[This document is largely based on the paper "An Open-Source Shallow-Transfer Machine Translation Engine for the Romance Languages of Spain", presented by Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, and Kepa Sarasola at the 10th Conference of the European Association for Machine Translation (Budapest, may 30-31, 2005).]
This document describes Apertium: an open-source shallow-transfer machine translation (MT) engine, initially aimed at related-language pairs. The shallow-transfer architecture will also be suitable for pairs of closely related languages: Romance language pairs such as Spanish-Catalan, Spanish-Portuguese, or other language pairs such as Czech-Slovak, Danish-Swedish, Kirwanda-Kiswahili, etc.
Existing MT programs are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages, and may use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.
The MT architecture proposed here uses finite-state transducers for lexical processing , hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group such as interNOSTRUM (Spanish-Catalan, Canals-Marote et al. 2001, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, Garrido-Alenda et al. 2003, http://traductor.universia.net); these systems are publicly accessible through the net and used on a daily basis by thousands of users.
The MT engine and toolbox has been released in two packages:
lttoolbox
(containing all the lexical processing
modules and tools) and apertium
itself (containing
the rest of the engine) under an open-source license (GPL). In
addition to the toolbox, open-source data are available for three
language pairs: Spanish-Catalan (apertium-es-ca
) and
Spanish-Galician (apertium-es-gl
), developed under
the OpenTrad consortium,
and Spanish-Portuguese (apertium-es-pt
), developed
at the University of Alacant. This means that anyone having the
necessary computational and linguistic skills will be able to
adapt or enhance it to produce a new MT system, even for other
pairs of related languages. The first version of the whole
system, together with linguistic data for the Spanish-Catalan
language pair was released on July 29, 2005. The description in
this document applies to version 1.0 of apertium
and
version 1.0 of lttoolbox
.
Web prototypes for all three pairs may also be tested on plain texts, RTF and HTML documents and websites at the address www.apertium.org.
We expect that the introduction of a unified open-source MT architecture will ease some of the mentioned problems (having different technologies for different pairs, closed-source architectures being hard to adapt to new uses, etc.). It will also help shift the current business model from a licence-centred one to a services-centred one, and favour the interchange of existing linguistic data through the use of the XML-based formats defined in this project.
The following sections give an overview of the architecture (sec. 2), the formats defined for the encoding of linguistic data (sec. 3), and the compilers used to convert these data into an executable form (sec. 4); finally, we give some concluding remarks (sec. 5).
2. The Apertium MT architecture
The MT strategy used in Apertium has already been described in
detail (Canals-Marote et al. 2001; Garrido-Alenda et al. 2003); a
sketch will be given here. The engine is a classical
shallow-transfer or transformer system consisting of an 8-module
assembly line; we have found that this strategy is
sufficient to achieve a reasonable translation quality between
related languages such as Spanish (es
), Catalan
(ca
), Galician (gl
) or Portuguese
(pt
). While, for these languages, a rudimentary
word-for-word MT model may give an adequate translation for about
75% of the text (measured as the percentage of words in a text
that do not need correction), the addition of homograph
disambiguation, management of contiguous multi-word units, and
local reordering and agreement rules may raise the fraction of
adequately translated text above 90%. This is the approach used
in the engine presented here, and we expect it to be useful for
other related-language pairs.
To ease diagnosis and independent testing, modules communicate
between them using text streams (examples below give an idea of
the communication format used). This allows for some of the
modules to be used in isolation, independently from the rest of
the MT system, for other natural-language processing tasks. The
apertium
package includes a shell script,
apertium-translator
which calls all modules as
necessary for a given language pair and a given text format.
The modules are organized as in the diagram in Figure 1 at the end of this document. Most of the modules are capable of processing tens of thousands of words per second on current desktop workstations; only the structural transfer module lags behind at several thousands of words per second. The following sections describe each module of the shallow-transfer architecture in detail.
2.1. The de-formatter
The de-formatter (generated automatically from a formatting specification file, see 3.4) separates the text to be translated from the format information (RTF, HTML, etc.). Format information is encapsulated so that the rest of the modules treat it as blanks between words. For example, the HTML text in Spanish:
vi <em>una señal</em>
("I saw a signal") would be processed by the de-formatter so that it would encapsulate the HTML tags between brackets and deliver
vi[ <em>]una señal[</em>]
The character sequences in brackets are treated as simple
blanks between words by the rest of the modules. As usual, the
escape symbol \
is used before symbols
[
and ]
if present in the text.
2.2. The morphological analyser
The morphological analyser (program lt-proc
in
package lttoolbox
with option -a
)
tokenizes the text in surface forms (lexical units as
they appear in texts) and delivers, for each surface form, one or
more lexical forms consisting of lemma, lexical category
and morphological inflection information. Tokenization is not
straightforward due to the existence, on the one hand, of
contractions, and, on the other hand, of multi-word lexical
units. For contractions, the system reads in a single surface
form and delivers the corresponding sequence of lexical forms
(for instance, the es
preposition-article
contraction del would be analysed into two lexical
forms, one for the preposition de and another one for
the article el). Multi-word surface forms are analysed
in a left-to-right, longest-match fashion; for instance, the
analysis for the es
preposition a would not
be delivered when the input text is a través de
("through"), which is a multi-word preposition in
es
. Multi-word surface forms may be invariable (such
as a multi-word preposition or conjunction) or inflected (for
example, in es
, echaban de menos, "they
missed", is a form of the imperfect indicative tense of the verb
echar de menos, "to miss"). Apertium offers support for
many types of inflected multi-word units. The module reads in a
binary file compiled from a source-language morphological
dictionary (see section 3.1).
Upon receiving the example text in the previous section, the morphological analyser would deliver
^vi/ver<vblex><ifi><1><sg>$[
<em>]^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir<vblex><prs><3><sg>$
^señal/señal<n><f><sg>$[</em>]
where each surface form is analysed into one or more lexical
forms. For example, vi is analysed into lemma
ver, lexical category lexical verb (vblex
),
indefinite indicative (ifi
), 1st person, singular,
whereas una (a homograph) receives three analyses:
un, determinant, indefinite, feminine singular, and two
forms of the present subjunctive (prs) of the verb unir
("to join"). The characters "^" and "$" delimit the analyses for
each surface form; lexical forms for each surface form are
separatd by "/"; angle brackets "<...>" are used to delimit
grammatical symbols. The string after the "^" and before the
first "/" is the surface form as it appears in the source input
text.
Multi-word SFs may be invariable (such as multi-word prepositions or conjunctions) or inflected (for example, in Portuguese, tinham saudades, "they missed", is a form of the imperfect indicative tense of the verb ter saudades, "to miss"). Apertium offers supports for many types of inflected multiword units.
2.3. The part-of-speech tagger
As has been shown in the previous example, some surface forms (about 30% in Romance languages) are homographs, ambiguous forms for which the morphological analyser delivers more than one lexical form. The part-of-speech tagger chooses one of them, according to the lexical forms of neighbouring words. When translating between related languages, ambiguous surface forms are one of the main sources of errors when incorrectly solved.
The part-of-speech tagger is trained from a tagger definition
file (see section 3.2) and corpora data. The tagger has options
--tagger
for tagging (during machine translation),
--train
for unsupervised training and
--supervised
(offline, when building the machine
translation system). The result of training is a file containing
a hidden Markov model (HMM) which has been obtained on
representative source-language texts (using an open-source
training program). This file also contains the patterns to define
the behavior of the tagger and the information about the
ambiguity classes present during training. Two training modes are
possible: unsupervised one use a larger amount (millions of
words) of untagged text processed by the morphological analyser
and supervised one uses a small amount of tagged text (tens of
thousands of words) where a lexical form for each homograph has
been manually selected. The second method usually leads to a
slightly better performance (about 96% correct part-of-speech
tags considering homographs and non-homographs). We are currently
building a collection of open corpora (both untagged and tagged)
using texts published on the web under Creative Commons
licenses.
The result of processing the example text delivered by the morphological analyser with the part-of-speech tagger would be
^ver<vblex><ifi><1><sg>$[
<em>]^un<det><ind><f><sg>$
^señal<n><f><sg>$[</em>]
where the correct lexical form (determiner) has been selected for the word una.
2.4. The lexical transfer module
The lexical transfer module which is implemented inside the
lttoolbox
library is called by the structural
transfer module (see next section); it reads each source-language
lexical form and delivers a corresponding target-language lexical
form. The module reads in a binary file compiled from a bilingual
dictionary (see section 3.1). The dictionary contains a single
equivalent for each source-language entry; that is, no word-sense
disambiguation is performed. For some words, multi-word entries
are used to safely select the correct equivalent in
frequently-occurring fixed contexts. This approach has been used
with very good results in Traductor Universia and
interNOSTRUM.
Each of the lexical forms in the running example would be translated into Catalan as follows:
ver<vblex> ---> veure<vblex>
un<det> ---> un<det>
señal<n><f> --->
senyal<n><m>
where the remaining grammatical symbols for each lexical form would be simply copied to the target-language output. Note the gender change to masculine when translating señal into Catalan senyal.
2.5. The structural transfer module
From the release of Apertium 1.0 on, a generic structural transfer module interprets a slightly preprocessed version of the structural transfer specfication file (see 3.3); it uses uses finite-state pattern matching to detect (in the usual left-to-right, longest-match way) fixed-length patterns of lexical forms (chunks or phrases) needing special processing due to grammatical divergences between the two languages (gender and number changes to ensure agreement in the target language, word reorderings, lexical changes such as changes in prepositions, etc.) and performs the corresponding transformations.
Optionally, the module may be compiled from the structural transfer specification file to increase slightly the translation speed, but in this case each language pair would have a different structural transfer module (this was the usual situation until the release of Apertium 1.0).
In the running example, a determiner-noun rule is used to change the gender of the determiner so that it agrees with the noun; the result is
^veure<vblex><ifi><1><sg>$[
<em>]^un<det><ind><m><sg>$
^senyal<n><m><sg>$[</em>]
2.6. The morphological generator
The morphological generator (program lt-proc
in
package lttoolbox
with option -g
)
delivers a target-language surface form for each target-language
lexical form, by suitably inflecting it. The module reads in a
binary file compiled from a target-language morphological
dictionary (see section 3.1). The result for the running example
would be
vaig veure[ <em>]un senyal[</em>]
2.7. The post-generator
The post-generator (program lt-proc
in package
lttoolbox
with option -p
) performs
orthographical operations such as contractions and
apostrophations. The module reads in a binary file compiled from
a rule file expressed as a dictionary (section 3.1). The
post-generator is usually dormant (just copies the input
to the output) until a special alarm symbol contained in
some target-language surface forms wakes it up to
perform a particular string transformation if necessary; then it
goes back to sleep.
For example, in Catalan, clitic pronouns in contact may change before a verb: em ("to me") and ho ("it") contract into m'ho, em and els ("them") contract into me'ls and em and la ("her") are written me la. To signal these changes, linguists prepend an alarm to the target-language surface form "em" in target-language dictionaries and write post-generation rules to ensure the changes described.
2.8. The re-formatter
Finally, the re-formatter restores the format information encapsulated by the de-formatter into the translated text and removes the encapsulation sequences used to protect certain characters in the source text. The result for the running example would be the correct translation of the HTML text:
vaig veure <em>un senyal</em>
3. Formats for linguistic data
An adequate documentation of the code and auxiliary files is crucial to the success of open-source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The formats used by this architecture are based on XML (World Wide Web Consortium, 2004) for interoperability; in particular, for easier parsing, transformation, and maintenance.
The XML formats for each type of linguistic data are defined
through conveniently-designed XML document-type definitions
(DTDs) which may be found inside the apertium
package (available through www.apertium.org). On the one hand,
the success of the OS MT engine heavily depends on the acceptance
of these formats by other groups; this is indeed the mechanism by
which de facto standards appear. Acceptance may be eased
by the use of an interoperable XML-based format which, as
mentioned, simplifies the transformation of data from and towards
it, and also by the availability of tools to manage linguistic
data in these formats; the current project is expected to produce
transformation and management tools in a later phase. But, on the
other hand, acceptance of the formats also depends on the success
of the translation engine itself.
3.1. Dictionaries (lexical processing)
Monolingual morphological dictionaries, bilingual dictionaries
and post-generation dictionaries use a common format, defined by
DTD dix.dtd
in package apertium
.
Morphological dictionaries establish the correspondences between surface forms and lexical forms and contain (a) a definition of the alphabet (used by the tokenizer), (b) a section defining the grammatical symbols used in a particular application to specify lexical forms (symbols representing concepts such as noun, verb, plural, present, feminine, etc.), (c) a section defining paradigms (describing reusable groups of correspondences between parts of surface forms and parts of lexical forms), and (d) one or more labelled dictionary sections containing lists of surface form-lexical form correspondences for whole lexical units (including contiguous multi-word units). Paradigms may be used directly in the dictionary sections or to build larger paradigms (at the conceptual level, paradigms represent the regularities in the inflective system of the corresponding language).
Bilingual dictionaries have a very similar structure and establish correspondences between source-language lexical forms and target-language lexical forms, but seldom use paradigms.
Finally, post-generation dictionaries are used to establish correspondences between input and output strings corresponding to the orthographical transformations to be performed by the post-generator on the target-language surface forms generated by the generator.
3.2. Tagger definition
Source-language lexical forms delivered by the morphological analyser are defined in terms of fine part-of-speech tags (for example, the word cantábamos (es, "we sang") has lemma cantar ("sing"), category verb, and the following inflection information: indicative, imperfect, 1st person, plural), which are necessary in some parts of the MT engine (structural transfer, morphological generation); however, for the purpose of efficient disambiguation, these fine part-of-speech tags may be grouped in coarser part-of-speech tags (such as verb in personal form).
The tagger definition file is also an XML file (the
corresponding DTD, tagger.dtd
, may also be found in
the apertium
package) where (a) coarser tags are
defined in terms of fine tags, both for single-word and for
multi-word units, (b) constraints may defined to forbid or
enforce certain sequences of part-of-speech tags, and (c)
priority lists are used to decide which fine part-of-speech tag
to pass on to the structural transfer module when the coarse
part-of-speech tag contains more than a fine tag. The tagger
definition file is used to define the behaviour of the
part-of-speech tagger both when it is being trained on a
source-language corpus and when it is running as part of the MT
system.
3.3. Structural transfer
An XML format for shallow structural transfer rules has also
been established; a commented DTD (transfer.dtd
) may
be found inside the apertium
package.
Structural transfer rule files contain pattern--action rules
which describe what has to be done for each pattern (much like in
languages such as perl
or lex
).
Patterns are defined in terms of categories which are in turn
defined (in the preamble) in terms of fine morphological tags
and, optionally, lemmas for lexicalized rules. For example, a
commonly used pattern, determiner-noun, has an
associated action which sets the gender and number of the
determiner to those of the noun to ensure gender and number
agreement.
Using a declarative notation such as XML is rather straightforward for the pattern part of rules but using it for the action (procedural) part means stretching it a bit; we have, however, found a reasonable way to express linguistic transformations in XML. In this way, we follow as close as possible the declarative approach used in the XML files defining the linguistic data used for the tagger and for the lexical processing modules.
3.4. De-formatter and re-formatter
De-formatters and re-formatters are generated from format
management files specified by the DTD
format.dtd
in package apertium
. These
are not linguistic data but are considered in this section for
convenience. Format management files for RTF
(format-rtf.xml
), HTML
(format-html.xml
) and plain ISO-8859-1 text
(format-txt.xml
) are provided in package
apertium
. Scripts apertium-gen-deformat
and apertium-gen-reformat
in the
apertium
package generate C++ de-formatters and
re-formatters respectively for each format using lex
as an intermediate representation.
4. Compilers and preprocessors
The Apertium toolbox contains compilers to convert the linguistic data into the corresponding efficient form used by the modules of the engine. Two main compilers are used in this project: one for the four lexical processing modules of the system and another one for the structural transfer.
4.1. Lexical processing
The lexical processor compiler (lt-comp
in
package lttoolbox
) is very fast (it takes about a
minute to compile the current dictionaries in the system) thanks
to the use of advanced transducer building strategies and to the
minimization of partial finite-state transducers (Roche &
Schabes 1997) during construction . This makes linguistic data
development much easier, because the effect on the whole system
of changing a rule or a lexical item may be tested almost
immediately.
The four lexical processing modules (morphological analyser,
lexical transfer, morphological generator, post-generator) are
implemented as a single program (lt-proc
in package
lttoolbox
) which reads binary files containing a
compact and efficient representation of a class of finite-state
transducers (in particular, augmented letter transducers,
Garrido-Alenda et al. 2002).
4.2. Structural transfer
The current structural transfer preprocessor (file
apertium-preprocess-transfer
in package
apertium
) reads in a structural transfer rule file
(see section 3.3) and generates a file with precompiled patterns
and indexes the actions of the rules of the structural transfer
module specification.
As mentioned in section 2.5, structural transfer rules for a given language pair may also be compiled into a specific structural transfer module, if a slight increase in translation speed is desired (this was the default until the release of Apertium 1.0).
5. Concluding remarks
This document describes Apertium: an open-source shallow-transfer machine translation engine for related-language pairs, developed in a large, government-funded open-source development project. It may be adapted to translating between Romance languages of Europe (French, Portuguese, Italian, Occitan, etc.), between European related language pairs outside the Romance group (Danish-Swedish Czech-Slovak, etc.), or even between other related languages (for instance, Kirwanda-Swahili).
The Apertium shallow-transfer engine has not been designed from scratch but may rather be seen as a complete open-source rewriting of an existing engine (interNOSTRUM, Canals-Marote et al. 2001; Traductor Universia, Garrido-Alenda et al. 2003) which is currently used daily by thousands of people through the net, and the corresponding redesign of linguistic data formats and rewriting of compilers.
The code (in two packages, lttoolbox
and
apertium
), together with pilot Spanish-Catalan
(package apertium-es-ca
), Spanish-Galician
(apertium-es-gl
), and Spanish-Portuguese
(apertium-es-pt
) linguistic data is available
through http://sourceforge.net/projects/apertium/.
Acknowledgements: This work has been funded through project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism), with support from project TIC2003-08681-C02-01 (Spanish Ministry of Science and Technology). Felipe Sánchez-Martínez is supported by the Spanish Ministry of Science and Education and the European Social Fund through grant BES-2004-4711.
6. References
Figure 1: Block diagram of the Apertium machine translation system
+---------+ +-----------------+ | SL text | ---> | de-formatter | +---------+ +-----------------+ | V +-----------------+ | morph. analyser | +-----------------+ | V +-----------------+ | part-of-speech | | tagger | +-----------------+ | V +-----------------+ +-----------+ | structural | <-> | lexical | | transfer | | transfer | +-----------------+ +-----------+ | V +-----------------+ | morphological | | generator | +-----------------+ | V +-----------------+ | post-generator | +-----------------+ | V +-----------------+ +---------+ | re-formatter | --> | TL text | +-----------------+ +---------+