APERTIUM: AN OPEN-SOURCE SHALLOW-TRANSFER MACHINE TRANSLATION ENGINE FOR RELATED-LANGUAGE PAIRS
Carmen Armentano-Oller, Boyan I. Bonev, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez,
Transducens group, Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant, E-03071 Alacant (Spain).
carmentano [at] dlsi.ua.es, bib [at] alu.ua.es, acorbi
[at] dlsi.ua.es, mlf [at] ua.es, mginesti [at] dlsi.ua.es, sortiz
[at] dlsi.ua.es, japerez [at] dlsi.ua.es, gema [at]
internostrum.com, fsanchez [at] dlsi.ua.es
Abstract. We briefly describe Apertium: an open-source shallow-transfer machine translation engine, initially aimed at related-language pairs. Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group at the Universitat d'Alacant, such as interNOSTRUM (Spanish-Catalan, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, http://traductor.universia.net). It will be possible to use Apertium to build machine translation systems for a variety of related-language pairs; to that end, the project proposes simple standard formats to encode the linguistic data needed. This paper briefly describes the machine translation engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine.
1. Introduction
[This document is largely based on the paper "An Open-Source Shallow-Transfer Machine Translation Engine for the Romance Languages of Spain", presented by Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, and Kepa Sarasola at the 10 th Conference of the European Association for Machine Translation (Budapest, may 30--31, 2005).]
This document describes Apertium: an open-source shallow-transfer machine translation (MT) engine, initially aimed at related-language pairs. The shallow-transfer architecture will also be suitable for pairs of closely related languages: Romance language pairs such as Spanish-Catalan, Spanish-Portuguese, or other language pairs such as Czech-Slovak, Danish-Swedish, Kirwanda-Kiswahili, etc.
Existing MT programs are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages, and may use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.
The MT architecture proposed here uses finite-state transducers for lexical processing , hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group such as interNOSTRUM (Spanish-Catalan, Canals-Marote et al. 2001, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, Garrido-Alenda et al. 2003, http://traductor.universia.net); these systems are publicly accessible through the net and used on a daily basis by thousands of users.
The MT engine will be released in two packages:
lt-toolbox
(containing all the lexical processing
modules and tools) and apertium
itself (containing
the rest of the engine) under an open-source license (GPL),
together with pilot linguistic data (in separate packages),
initially for Romance languages of Spain, Spanish-Catalan
(apertium-es-ca-data
) and Spanish-Galician
(apertium-es-gl-data
), and will be distributed free
of charge. This means that anyone having the necessary
computational and linguistic skills will be able to adapt or
enhance it to produce a new MT system, even for other pairs of
related languages. The first version of the whole system,
together with linguistic data for the Spanish-Catalan is
scheduled to be released on July 29, 2005. Spanish-Galician data
will be released later in 2005.
We expect that the introduction of a unified open-source MT architecture will ease some of the mentioned problems (having different technologies for different pairs, closed-source architectures being hard to adapt to new uses, etc.). It will also help shift the current business model from a licence-centred one to a services-centred one, and favour the interchange of existing linguistic data through the use of the XML-based formats defined in this project.
The following sections give an overview of the architecture (sec. 2), the formats defined for the encoding of linguistic data (sec. 3), and the compilers used to convert these data into an executable form (sec. 4); finally, we give some concluding remarks (sec. 5).
2. The Apertium MT architecture
The MT strategy used in Apertium has already been described in detail (Canals-Marote et al. 2001; Garrido-Alenda et al. 2003); a sketch will be given here. The engine is a classical shallow-transfer or transformer system consisting of an 8-module assembly line; we have found that this strategy is sufficient to achieve a reasonable translation quality between related languages such as Spanish (es), Catalan (ca) or Galician (gl). While, for these languages, a rudimentary word-for-word MT model may give an adequate translation for about 75% of the text, the addition of homograph disambiguation, management of contiguous multi-word units, and local reordering and agreement rules may raise the fraction of adequately translated text above 90%. This is the approach used in the engine presented here, and we expect it to be useful for other related-language pairs.
To ease diagnosis and independent testing, modules communicate between them using text streams (examples below give an idea of the communication format used). This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural-language processing tasks.
The modules are organized as in the diagram in Figure 1 at the end of this document. Most of the modules are capable of processing tens of thousands of words per second on current desktop workstations; only the structural transfer module lags behind at several thousands of words per second. The following sections describe each module of the shallow-transfer architecture in detail.
2.1. The de-formatter
The de-formatter (generated automatically from a formatting specification file, see 3.4) separates the text to be translated from the format information (RTF, HTML, etc.). Format information is encapsulated so that the rest of the modules treat it as blanks between words. For example, the HTML text in Spanish:
vi <em>una señal</em>
("I saw a signal") would be processed by the de-formatter so that it would encapsulate the HTML tags between brackets and deliver
vi[ <em>]una señal[</em>]
The character sequences in brackets are treated as simple blanks between words by the rest of the modules.
2.2. The morphological analyser
The morphological analyser (program lt-proc
in
package lt-toolbox
with option -a
)
tokenizes the text in surface forms (lexical units as they appear
in texts) and delivers, for each surface form, one or more
lexical forms consisting of lemma, lexical category and
morphological inflection information. Tokenization is not
straightforward due to the existence, on the one hand, of
contractions, and, on the other hand, of multi-word lexical
units. For contractions, the system reads in a single surface
form and delivers the corresponding sequence of lexical forms
(for instance, the es preposition-article contraction
del would be analysed into two lexical forms, one for
the preposition de and another one for the article
el). Multi-word surface forms are analysed in a
left-to-right, longest-match fashion; for instance, the analysis
for the es preposition a would not be delivered when the
input text is a través de ("through"), which is a
multi-word preposition in es. Multi-word surface forms may be
invariable (such as a multi-word preposition or conjunction) or
inflected (for example, in es, echaban de menos, "they
missed", is a form of the imperfect indicative tense of the verb
echar de menos, "to miss"). Limited support for some
kinds of discontinuous multi-word units is also available. The
module reads in a binary file compiled from a source-language
morphological dictionary (see section 3.1).
Upon receiving the example text in the previous section, the morphological analyser would deliver
^vi/ver<vblex><ifi><1><sg>$[
<em>]^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir<vblex><prs><3><sg>$
^señal/señal<n><f><sg>$[</em>]
where each surface form is analysed into one or more lexical
forms. For example, vi is analysed into lemma
ver, lexical category lexical verb (vblex
),
indefinite indicative (ifi
), 1st person, singular,
whereas una (a homograph) receives three analyses:
un, determinant, indefinite, feminine singular, and two
forms of the present subjunctive (prs) of the verb unir
("to join"). The characters "^" and "$" delimit the analyses for
each surface form; lexical forms for each surface form are
separated by "/"; angle brackets "<...>" are used to
delimit grammatical symbols. The string after the "^" and before
the first "/" is the surface form as it appears in the source
input text.
2.3. The part-of-speech tagger
As has been shown in the previous example, some surface forms (about 30% in Romance languages) are homographs, ambiguous forms for which the morphological analyser delivers more than one lexical form. The part-of-speech tagger chooses one of them, according to the lexical forms of neighbouring words. When translating between related languages, ambiguous surface forms are one of the main sources of errors when incorrectly solved.
The part-of-speech tagger is generated by the
apertium-gen-tagger
script from a tagger definition
file (see section 3.2). The resulting program has options
--tagger
for tagging (during machine translation)
and --train
for training (offline, when building the
machine translation system). The result of training is a file
containing a hidden Markov model (HMM) which has been obtained on
representative source-language texts (using an open-source
training program). Two training modes are possible: one can use
either a larger amount (millions of words) of untagged text
processed by the morphological analyser or a small amount of
tagged text (tens of thousands of words) where a lexical form for
each homograph has been manually selected. The second method
usually leads to a slightly better performance (about 96% correct
part-of-speech tags). We are currently building a collection of
open corpora (both untagged and tagged) using texts published on
the web under Creative Commons licenses.
The result of processing the example text delivered by the morphological analyser with the part-of-speech tagger would be
^ver<vblex><ifi><1><sg>$[
<em>]^un<det><ind><f><sg>$
^señal<n><f><sg>$[</em>]
where the correct lexical form (determiner) has been selected for the word una.
2.4. The lexical transfer module
The lexical transfer module which is implemented inside the lt-toolbox library is called by the structural transfer module (see next section); it reads each source-language lexical form and delivers a corresponding target-language lexical form. The module reads in a binary file compiled from a bilingual dictionary (see section 3.1). The dictionary contains a single equivalent for each source-language entry; that is, no word-sense disambiguation is performed. For some words, multi-word entries are used to safely select the correct equivalent in frequently-occurring fixed contexts. This approach has been used with very good results in Traductor Universia and interNOSTRUM.
Each of the lexical forms in the running example would be translated into Catalan as follows:
ver<vblex> ---> veure<vblex>
un<det> ---> un<det>
señal<n><f> --->
senyal<n><m>
where the remaining grammatical symbols for each lexical form would be simply copied to the target-language output. Note the gender change to masculine when translating señal into Catalan senyal.
2.5. The structural transfer module
The structural transfer module (generated automatically by apertium-gen-transfer script from a structural transfer specfication file, see 3.3) uses finite-state pattern matching to detect (in the usual left-to-right, longest-match way) fixed-length patterns of lexical forms (chunks or phrases) needing special processing due to grammatical divergences between the two languages (gender and number changes to ensure agreement in the target language, word reorderings, lexical changes such as changes in prepositions, etc.) and performs the corresponding transformations. This module is compiled from a transfer rule file (see section 3.3). In the running example, a determiner-noun rule is used to change the gender of the determiner so that it agrees with the noun; the result is
^veure<vblex><ifi><1><sg>$[
<em>]^un<det><ind><m><sg>$
^senyal<n><m><sg>$[</em>]
2.6. The morphological generator
The morphological generator (program lt-proc
in
package lt-toolbox
with option -g
)
delivers a target-language surface form for each target-language
lexical form, by suitably inflecting it. The module reads in a
binary file compiled from a target-language morphological
dictionary (see section 3.1). The result for the running example
would be
vaig veure[ <em>]un senyal[</em>]
2.7. The post-generator
The post-generator (program lt-proc
in package
lt-toolbox
with option -p
) performs
orthographical operations such as contractions and
apostrophations. The module reads in a binary file compiled from
a rule file expressed as a dictionary (section 3.1). The
post-generator is usually dormant (just copies the input
to the output) until a special alarm symbol contained in
some target-language surface forms wakes it up to
perform a particular string transformation if necessary; then it
goes back to sleep.
For example, in Catalan, clitic pronouns in contact may change before a verb: em ("to me") and ho ("it") contract into m'ho, em and els ("them") contract into me'ls and em and la ("her") are written me la. To signal these changes, linguists prepend an alarm to the target-language surface form "em" in target-language dictionaries and write post-generation rules to ensure the changes described.
2.8. The re-formatter
Finally, the re-formatter restores the format information encapsulated by the de-formatter into the translated text and removes the encapsulation sequences used to protect certain characters in the source text. The result for the running example would be the correct translation of the HTML text:
vaig veure <em>un senyal</em>
3. Formats for linguistic data
An adequate documentation of the code and auxiliary files is crucial to the success of open-source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The formats used by this architecture are modified versions of the formats currently used by interNOSTRUM and Traductor Universia. These programs used an ad-hoc text-based format; in the current project, these formats have been converted into XML (World Wide Web Consortium, 2004) for interoperability; in particular, for easier parsing, transformation, and maintenance. The XML formats for each type of linguistic data are defined through XML document-type definitions (DTDs).
3.1. Dictionaries (lexical processing)
The format for monolingual morphological dictionaries and
bilingual dictionaries may be seen as an XML version of the
format already used in interNOSTRUM or Traductor Universia, which
was defined in Garrido et al. 1999. The current DTD
(dix.dtd
) will be made available as part of the
lt-toolbox
package and examples of Spanish--Catalan
morphological (apertium-es-ca.es.dix
,
apertium-es-ca.ca.dix
) and bilingual dictionary
(apertium-es-ca.es-ca.dix
) will be made available in
the apertium-es-ca-data
linguistic data package.
Morphological dictionaries establish the correspondences between surface forms and lexical forms and contain (a) a definition of the alphabet (used by the tokenizer), (b) a section defining the grammatical symbols used in a particular application to specify lexical forms (symbols representing concepts such as noun, verb, plural, present, feminine, etc.), (c) a section defining paradigms (describing reusable groups of correspondences between parts of surface forms and parts of lexical forms), and (d) one or more labelled dictionary sections containing lists of surface form-lexical form correspondences for whole lexical units (including contiguous multi-word units). Paradigms may be used directly in the dictionary sections or to build larger paradigms (at the conceptual level, paradigms represent the regularities in the inflective system of the corresponding language). Bilingual dictionaries have a very similar structure and establish correspondences between source-language lexical forms and target-language lexical forms, but seldom use paradigms. Finally, post-generation dictionaries are used to establish correspondences between input and output strings corresponding to the orthographical transformations to be performed by the post-generator on the target-language surface forms generated by the generator.
3.2. Tagger definition
Source-language lexical forms delivered by the morphological analyser are defined in terms of fine part-of-speech tags (for example, the word cantábamos (es, "we sang") has lemma cantar ("sing"), category verb", and the following inflection information: indicative, imperfect, 1st person, plural), which are necessary in some parts of the MT engine (structural transfer, morphological generation); however, for the purpose of efficient disambiguation, these fine part-of-speech tags may be grouped in coarser part-of-speech tags (such as verb in personal form).
The tagger definition file is also an XML file (the
corresponding DTD, tagger.dtd
, may also be found in
the apertium
package) where (a) coarser tags are
defined in terms of fine tags, both for single-word and for
multi-word units, (b) constraints may defined to forbid or
enforce certain sequences of part-of-speech tags, and (c)
priority lists are used to decide which fine part-of-speech tag
to pass on to the structural transfer module when the coarse
part-of-speech tag contains more than a fine tag. The tagger
definition file is used to define th behaviour of the
part-of-speech tagger both when it is being trained on a
source-language corpus and when it is running as part of the MT
system.
3.3. Structural transfer
An XML format for shallow structural transfer rules has also
been established; a commented DTD (transfer.dtd
) may
be found inside the apertium
package.
The rule files contain pattern-action rules describing what has to be done for each pattern (much like in languages such as perl or lex). Using a declarative notation such as XML is rather straightforward for the pattern part of rules but using it for the action (procedural) part means stretching it a bit; we have, however, found a reasonable way to translate the ad-hoc C-style action language used in the corresponding module of interNOSTRUM and Traductor Universia, which was defined in detail in Garrido-Alenda and Forcada (2001), into a simple XML notation having the same expressiveness. In this way, we follow as close as possible the declarative approach used in the XML files defining the linguistic data used for the tagger and for the lexical processing modules.
3.4. De-formatter and re-formatter
The de-formatters and re-formatters used in Traductor
Universia and interNOSTRUM (for plain ISO-8859-1/15 text, HTML
and RTF) were written directly in flex (lex) using a
pattern-action scheme, with patterns specified as regular
expressions and actions written in C code, using lex to generate
the executable code. In Apertium, de-formatters and re-formatters
written in lex are generated automatically, using a couple of
scripts, apertium-gen-deformat and apertium-gen-reformat from XML
files describing their behavior, which follow the DTD
format.dtd
which may be found in the
apertium
package, which also provides formatting
specification files for plain text (format-txt.xml
),
HTML (format-html.xml
), and RTF
(format-rtf.xml
).
4. Compilers
Compilers to convert the linguistic data into the corresponding efficient form used by the modules of the engine are currently under development. Two compilers are used in this project: one for the four lexical processing modules of the system and another one for the structural transfer.
4.1. Lexical processing
The four lexical processing modules (morphological analyser,
lexical transfer, morphological generator, post-generator) are
currently being implemented as a single program (program
lt-proc
in package lt-toolbox
) which
reads binary files containing a compact and efficient
representation of a class of finite-state transducers (letter
transducers, Roche & Schabes 1997); in particular, augmented
letter transducers (Garrido-Alenda et al. 2002). These binaries
are an improved version of those used in interNOSTRUM and
Traductor Universia and are generated from XML dictionaries
(specified in section 3.1) using a new compiler (program
lt-comp
in package lt-toolbox
),
completely rewritten from scratch. The new compiler is much
faster (taking seconds instead of minutes to compile the current
dictionaries in interNOSTRUM and Traductor Universia) and uses
much less memory, thanks to the use of new transducer building
strategies and to the minimization of partial finite-state
transducers during construction. This makes linguistic data
development much easier, because the effect on the whole system
of changing a rule or a lexical item may be tested almost
immediately.
4.2. Structural transfer
Instead of a proper compiler, a script
(apertium-gen-transfer
) using a XSLT stylesheet
(file transfer.xsl
in package apertium
)
is simply used to transform the structural transfer specification
file in XML (see section 3.3) into a flex-based program.
5. Concluding remarks
This document describes Apertium: an open-source shallow-transfer machine translation engine for related-language pairs, developed in a large, government-funded open-source development project. It may be adapted to translating between Romance languages of Europe (French, Portuguese, Italian, Occitan, etc.), between European related language pairs outside the Romance group (Danish-Swedish Czech-Slovak, etc.), or even between other related languages (for instance, Kirwanda-Swahili).
The Apertium shallow-transfer engine has not been designed from scratch but may rather be seen as a complete open-source rewriting of an existing closed-source engine (interNOSTRUM, Canals-Marote et al. 2001; Traductor Universia, Garrido-Alenda et al. 2003) which is currently used daily by thousands of people through the net, and the corresponding redesign of linguistic data formats and rewriting of compilers.
The code (in two packages, lt-toolbox
and
apertium
), together with pilot Spanish-Catalan
linguistic data (package apertium-es-ca-data
) to
demonstrate it, is scheduled to be released on July 29, 2005,
through http://sourceforge.net/projects/apertium/.
Acknowledgements: This work has been funded through project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism), with support from project TIC2003-08681-C02-01 (Spanish Ministry of Science and Technology). Felipe Sánchez-Martínez is supported by the Spanish Ministry of Science and Education and the European Social Fund through grant BES-2004-4711. We thank Carme Armentano-Oller for testing the architecture.
6. References
Figure 1: Block diagram of the Apertium machine translation system
+---------+ +-----------------+ | SL text | ---> | de-formatter | +---------+ +-----------------+ | V +-----------------+ | morph. analyser | +-----------------+ | V +-----------------+ | part-of-speech | | tagger | +-----------------+ | V +-----------------+ +-----------+ | structural | <-> | lexical | | transfer | | transfer | +-----------------+ +-----------+ | V +-----------------+ | morphological | | generator | +-----------------+ | V +-----------------+ | post-generator | +-----------------+ | V +-----------------+ +---------+ | re-formatter | --> | TL text | +-----------------+ +---------+