APERTIUM: AN OPEN-SOURCE SHALLOW-TRANSFER MACHINE TRANSLATION ENGINE FOR RELATED-LANGUAGE PAIRS

Carmen Armentano-Oller, Boyan I. Bonev, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez,

Transducens group, Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant, E-03071 Alacant (Spain).

carmentano [at] dlsi.ua.es, bib [at] alu.ua.es, acorbi [at] dlsi.ua.es, mlf [at] ua.es, mginesti [at] dlsi.ua.es, sortiz [at] dlsi.ua.es, japerez [at] dlsi.ua.es, gema [at] internostrum.com, fsanchez [at] dlsi.ua.es

This documentation is distributed under the GNU General Public License ("http://www.gnu.org/licenses/gpl.html)

Abstract. We briefly describe Apertium: an open-source shallow-transfer machine translation engine, initially aimed at related-language pairs. Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group at the Universitat d'Alacant, such as interNOSTRUM (Spanish-Catalan, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, http://traductor.universia.net). It will be possible to use Apertium to build machine translation systems for a variety of related-language pairs; to that end, the project proposes simple standard formats to encode the linguistic data needed. This paper briefly describes the machine translation engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine.

1. Introduction

[This document is largely based on the paper "An Open-Source Shallow-Transfer Machine Translation Engine for the Romance Languages of Spain", presented by Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, and Kepa Sarasola at the 10 th Conference of the European Association for Machine Translation (Budapest, may 30--31, 2005).]

This document describes Apertium: an open-source shallow-transfer machine translation (MT) engine, initially aimed at related-language pairs. The shallow-transfer architecture will also be suitable for pairs of closely related languages: Romance language pairs such as Spanish-Catalan, Spanish-Portuguese, or other language pairs such as Czech-Slovak, Danish-Swedish, Kirwanda-Kiswahili, etc.

Existing MT programs are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages, and may use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.

The MT architecture proposed here uses finite-state transducers for lexical processing , hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group such as interNOSTRUM (Spanish-Catalan, Canals-Marote et al. 2001, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, Garrido-Alenda et al. 2003, http://traductor.universia.net); these systems are publicly accessible through the net and used on a daily basis by thousands of users.

The MT engine will be released in two packages: lt-toolbox (containing all the lexical processing modules and tools) and apertium itself (containing the rest of the engine) under an open-source license (GPL), together with pilot linguistic data (in separate packages), initially for Romance languages of Spain, Spanish-Catalan (apertium-es-ca-data) and Spanish-Galician (apertium-es-gl-data), and will be distributed free of charge. This means that anyone having the necessary computational and linguistic skills will be able to adapt or enhance it to produce a new MT system, even for other pairs of related languages. The first version of the whole system, together with linguistic data for the Spanish-Catalan is scheduled to be released on July 29, 2005. Spanish-Galician data will be released later in 2005.

We expect that the introduction of a unified open-source MT architecture will ease some of the mentioned problems (having different technologies for different pairs, closed-source architectures being hard to adapt to new uses, etc.). It will also help shift the current business model from a licence-centred one to a services-centred one, and favour the interchange of existing linguistic data through the use of the XML-based formats defined in this project.

The following sections give an overview of the architecture (sec. 2), the formats defined for the encoding of linguistic data (sec. 3), and the compilers used to convert these data into an executable form (sec. 4); finally, we give some concluding remarks (sec. 5).

2. The Apertium MT architecture

The MT strategy used in Apertium has already been described in detail (Canals-Marote et al. 2001; Garrido-Alenda et al. 2003); a sketch will be given here. The engine is a classical shallow-transfer or transformer system consisting of an 8-module assembly line; we have found that this strategy is sufficient to achieve a reasonable translation quality between related languages such as Spanish (es), Catalan (ca) or Galician (gl). While, for these languages, a rudimentary word-for-word MT model may give an adequate translation for about 75% of the text, the addition of homograph disambiguation, management of contiguous multi-word units, and local reordering and agreement rules may raise the fraction of adequately translated text above 90%. This is the approach used in the engine presented here, and we expect it to be useful for other related-language pairs.

To ease diagnosis and independent testing, modules communicate between them using text streams (examples below give an idea of the communication format used). This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural-language processing tasks.

The modules are organized as in the diagram in Figure 1 at the end of this document. Most of the modules are capable of processing tens of thousands of words per second on current desktop workstations; only the structural transfer module lags behind at several thousands of words per second. The following sections describe each module of the shallow-transfer architecture in detail.

2.1. The de-formatter

The de-formatter (generated automatically from a formatting specification file, see 3.4) separates the text to be translated from the format information (RTF, HTML, etc.). Format information is encapsulated so that the rest of the modules treat it as blanks between words. For example, the HTML text in Spanish:

vi <em>una señal</em>

("I saw a signal") would be processed by the de-formatter so that it would encapsulate the HTML tags between brackets and deliver

vi[ <em>]una señal[</em>]

The character sequences in brackets are treated as simple blanks between words by the rest of the modules.

2.2. The morphological analyser

The morphological analyser (program lt-proc in package lt-toolbox with option -a) tokenizes the text in surface forms (lexical units as they appear in texts) and delivers, for each surface form, one or more lexical forms consisting of lemma, lexical category and morphological inflection information. Tokenization is not straightforward due to the existence, on the one hand, of contractions, and, on the other hand, of multi-word lexical units. For contractions, the system reads in a single surface form and delivers the corresponding sequence of lexical forms (for instance, the es preposition-article contraction del would be analysed into two lexical forms, one for the preposition de and another one for the article el). Multi-word surface forms are analysed in a left-to-right, longest-match fashion; for instance, the analysis for the es preposition a would not be delivered when the input text is a través de ("through"), which is a multi-word preposition in es. Multi-word surface forms may be invariable (such as a multi-word preposition or conjunction) or inflected (for example, in es, echaban de menos, "they missed", is a form of the imperfect indicative tense of the verb echar de menos, "to miss"). Limited support for some kinds of discontinuous multi-word units is also available. The module reads in a binary file compiled from a source-language morphological dictionary (see section 3.1).

Upon receiving the example text in the previous section, the morphological analyser would deliver

^vi/ver<vblex><ifi><1><sg>$[ <em>]^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir<vblex><prs><3><sg>$ ^señal/señal<n><f><sg>$[</em>]

where each surface form is analysed into one or more lexical forms. For example, vi is analysed into lemma ver, lexical category lexical verb (vblex), indefinite indicative (ifi), 1st person, singular, whereas una (a homograph) receives three analyses: un, determinant, indefinite, feminine singular, and two forms of the present subjunctive (prs) of the verb unir ("to join"). The characters "^" and "$" delimit the analyses for each surface form; lexical forms for each surface form are separated by "/"; angle brackets "<...>" are used to delimit grammatical symbols. The string after the "^" and before the first "/" is the surface form as it appears in the source input text.

2.3. The part-of-speech tagger

As has been shown in the previous example, some surface forms (about 30% in Romance languages) are homographs, ambiguous forms for which the morphological analyser delivers more than one lexical form. The part-of-speech tagger chooses one of them, according to the lexical forms of neighbouring words. When translating between related languages, ambiguous surface forms are one of the main sources of errors when incorrectly solved.

The part-of-speech tagger is generated by the apertium-gen-tagger script from a tagger definition file (see section 3.2). The resulting program has options --tagger for tagging (during machine translation) and --train for training (offline, when building the machine translation system). The result of training is a file containing a hidden Markov model (HMM) which has been obtained on representative source-language texts (using an open-source training program). Two training modes are possible: one can use either a larger amount (millions of words) of untagged text processed by the morphological analyser or a small amount of tagged text (tens of thousands of words) where a lexical form for each homograph has been manually selected. The second method usually leads to a slightly better performance (about 96% correct part-of-speech tags). We are currently building a collection of open corpora (both untagged and tagged) using texts published on the web under Creative Commons licenses.

The result of processing the example text delivered by the morphological analyser with the part-of-speech tagger would be

^ver<vblex><ifi><1><sg>$[ <em>]^un<det><ind><f><sg>$ ^señal<n><f><sg>$[</em>]

where the correct lexical form (determiner) has been selected for the word una.

2.4. The lexical transfer module

The lexical transfer module which is implemented inside the lt-toolbox library is called by the structural transfer module (see next section); it reads each source-language lexical form and delivers a corresponding target-language lexical form. The module reads in a binary file compiled from a bilingual dictionary (see section 3.1). The dictionary contains a single equivalent for each source-language entry; that is, no word-sense disambiguation is performed. For some words, multi-word entries are used to safely select the correct equivalent in frequently-occurring fixed contexts. This approach has been used with very good results in Traductor Universia and interNOSTRUM.

Each of the lexical forms in the running example would be translated into Catalan as follows:

ver<vblex> ---> veure<vblex>

un<det> ---> un<det>

señal<n><f> ---> senyal<n><m>

where the remaining grammatical symbols for each lexical form would be simply copied to the target-language output. Note the gender change to masculine when translating señal into Catalan senyal.

2.5. The structural transfer module

The structural transfer module (generated automatically by apertium-gen-transfer script from a structural transfer specfication file, see 3.3) uses finite-state pattern matching to detect (in the usual left-to-right, longest-match way) fixed-length patterns of lexical forms (chunks or phrases) needing special processing due to grammatical divergences between the two languages (gender and number changes to ensure agreement in the target language, word reorderings, lexical changes such as changes in prepositions, etc.) and performs the corresponding transformations. This module is compiled from a transfer rule file (see section 3.3). In the running example, a determiner-noun rule is used to change the gender of the determiner so that it agrees with the noun; the result is

^veure<vblex><ifi><1><sg>$[ <em>]^un<det><ind><m><sg>$
^senyal<n><m><sg>$[</em>]

2.6. The morphological generator

The morphological generator (program lt-proc in package lt-toolbox with option -g) delivers a target-language surface form for each target-language lexical form, by suitably inflecting it. The module reads in a binary file compiled from a target-language morphological dictionary (see section 3.1). The result for the running example would be

vaig veure[ <em>]un senyal[</em>]

2.7. The post-generator

The post-generator (program lt-proc in package lt-toolbox with option -p) performs orthographical operations such as contractions and apostrophations. The module reads in a binary file compiled from a rule file expressed as a dictionary (section 3.1). The post-generator is usually dormant (just copies the input to the output) until a special alarm symbol contained in some target-language surface forms wakes it up to perform a particular string transformation if necessary; then it goes back to sleep.

For example, in Catalan, clitic pronouns in contact may change before a verb: em ("to me") and ho ("it") contract into m'ho, em and els ("them") contract into me'ls and em and la ("her") are written me la. To signal these changes, linguists prepend an alarm to the target-language surface form "em" in target-language dictionaries and write post-generation rules to ensure the changes described.

2.8. The re-formatter

Finally, the re-formatter restores the format information encapsulated by the de-formatter into the translated text and removes the encapsulation sequences used to protect certain characters in the source text. The result for the running example would be the correct translation of the HTML text:

vaig veure <em>un senyal</em>

3. Formats for linguistic data

An adequate documentation of the code and auxiliary files is crucial to the success of open-source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The formats used by this architecture are modified versions of the formats currently used by interNOSTRUM and Traductor Universia. These programs used an ad-hoc text-based format; in the current project, these formats have been converted into XML (World Wide Web Consortium, 2004) for interoperability; in particular, for easier parsing, transformation, and maintenance. The XML formats for each type of linguistic data are defined through XML document-type definitions (DTDs).

3.1. Dictionaries (lexical processing)

The format for monolingual morphological dictionaries and bilingual dictionaries may be seen as an XML version of the format already used in interNOSTRUM or Traductor Universia, which was defined in Garrido et al. 1999. The current DTD (dix.dtd) will be made available as part of the lt-toolbox package and examples of Spanish--Catalan morphological (apertium-es-ca.es.dix, apertium-es-ca.ca.dix) and bilingual dictionary (apertium-es-ca.es-ca.dix) will be made available in the apertium-es-ca-data linguistic data package.

Morphological dictionaries establish the correspondences between surface forms and lexical forms and contain (a) a definition of the alphabet (used by the tokenizer), (b) a section defining the grammatical symbols used in a particular application to specify lexical forms (symbols representing concepts such as noun, verb, plural, present, feminine, etc.), (c) a section defining paradigms (describing reusable groups of correspondences between parts of surface forms and parts of lexical forms), and (d) one or more labelled dictionary sections containing lists of surface form-lexical form correspondences for whole lexical units (including contiguous multi-word units). Paradigms may be used directly in the dictionary sections or to build larger paradigms (at the conceptual level, paradigms represent the regularities in the inflective system of the corresponding language). Bilingual dictionaries have a very similar structure and establish correspondences between source-language lexical forms and target-language lexical forms, but seldom use paradigms. Finally, post-generation dictionaries are used to establish correspondences between input and output strings corresponding to the orthographical transformations to be performed by the post-generator on the target-language surface forms generated by the generator.

3.2. Tagger definition

Source-language lexical forms delivered by the morphological analyser are defined in terms of fine part-of-speech tags (for example, the word cantábamos (es, "we sang") has lemma cantar ("sing"), category verb", and the following inflection information: indicative, imperfect, 1st person, plural), which are necessary in some parts of the MT engine (structural transfer, morphological generation); however, for the purpose of efficient disambiguation, these fine part-of-speech tags may be grouped in coarser part-of-speech tags (such as verb in personal form).

The tagger definition file is also an XML file (the corresponding DTD, tagger.dtd, may also be found in the apertium package) where (a) coarser tags are defined in terms of fine tags, both for single-word and for multi-word units, (b) constraints may defined to forbid or enforce certain sequences of part-of-speech tags, and (c) priority lists are used to decide which fine part-of-speech tag to pass on to the structural transfer module when the coarse part-of-speech tag contains more than a fine tag. The tagger definition file is used to define th behaviour of the part-of-speech tagger both when it is being trained on a source-language corpus and when it is running as part of the MT system.

3.3. Structural transfer

An XML format for shallow structural transfer rules has also been established; a commented DTD (transfer.dtd) may be found inside the apertium package.

The rule files contain pattern-action rules describing what has to be done for each pattern (much like in languages such as perl or lex). Using a declarative notation such as XML is rather straightforward for the pattern part of rules but using it for the action (procedural) part means stretching it a bit; we have, however, found a reasonable way to translate the ad-hoc C-style action language used in the corresponding module of interNOSTRUM and Traductor Universia, which was defined in detail in Garrido-Alenda and Forcada (2001), into a simple XML notation having the same expressiveness. In this way, we follow as close as possible the declarative approach used in the XML files defining the linguistic data used for the tagger and for the lexical processing modules.

3.4. De-formatter and re-formatter

The de-formatters and re-formatters used in Traductor Universia and interNOSTRUM (for plain ISO-8859-1/15 text, HTML and RTF) were written directly in flex (lex) using a pattern-action scheme, with patterns specified as regular expressions and actions written in C code, using lex to generate the executable code. In Apertium, de-formatters and re-formatters written in lex are generated automatically, using a couple of scripts, apertium-gen-deformat and apertium-gen-reformat from XML files describing their behavior, which follow the DTD format.dtd which may be found in the apertium package, which also provides formatting specification files for plain text (format-txt.xml), HTML (format-html.xml), and RTF (format-rtf.xml).

4. Compilers

Compilers to convert the linguistic data into the corresponding efficient form used by the modules of the engine are currently under development. Two compilers are used in this project: one for the four lexical processing modules of the system and another one for the structural transfer.

4.1. Lexical processing

The four lexical processing modules (morphological analyser, lexical transfer, morphological generator, post-generator) are currently being implemented as a single program (program lt-proc in package lt-toolbox) which reads binary files containing a compact and efficient representation of a class of finite-state transducers (letter transducers, Roche & Schabes 1997); in particular, augmented letter transducers (Garrido-Alenda et al. 2002). These binaries are an improved version of those used in interNOSTRUM and Traductor Universia and are generated from XML dictionaries (specified in section 3.1) using a new compiler (program lt-comp in package lt-toolbox), completely rewritten from scratch. The new compiler is much faster (taking seconds instead of minutes to compile the current dictionaries in interNOSTRUM and Traductor Universia) and uses much less memory, thanks to the use of new transducer building strategies and to the minimization of partial finite-state transducers during construction. This makes linguistic data development much easier, because the effect on the whole system of changing a rule or a lexical item may be tested almost immediately.

4.2. Structural transfer

Instead of a proper compiler, a script (apertium-gen-transfer) using a XSLT stylesheet (file transfer.xsl in package apertium) is simply used to transform the structural transfer specification file in XML (see section 3.3) into a flex-based program.

5. Concluding remarks

This document describes Apertium: an open-source shallow-transfer machine translation engine for related-language pairs, developed in a large, government-funded open-source development project. It may be adapted to translating between Romance languages of Europe (French, Portuguese, Italian, Occitan, etc.), between European related language pairs outside the Romance group (Danish-Swedish Czech-Slovak, etc.), or even between other related languages (for instance, Kirwanda-Swahili).

The Apertium shallow-transfer engine has not been designed from scratch but may rather be seen as a complete open-source rewriting of an existing closed-source engine (interNOSTRUM, Canals-Marote et al. 2001; Traductor Universia, Garrido-Alenda et al. 2003) which is currently used daily by thousands of people through the net, and the corresponding redesign of linguistic data formats and rewriting of compilers.

The code (in two packages, lt-toolbox and apertium), together with pilot Spanish-Catalan linguistic data (package apertium-es-ca-data) to demonstrate it, is scheduled to be released on July 29, 2005, through http://sourceforge.net/projects/apertium/.

Acknowledgements: This work has been funded through project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism), with support from project TIC2003-08681-C02-01 (Spanish Ministry of Science and Technology). Felipe Sánchez-Martínez is supported by the Spanish Ministry of Science and Education and the European Social Fund through grant BES-2004-4711. We thank Carme Armentano-Oller for testing the architecture.

6. References

Figure 1: Block diagram of the Apertium machine translation system

+---------+       +-----------------+
| SL text |  ---> | de-formatter    |
+---------+       +-----------------+
                           |
                           V
                  +-----------------+
                  | morph. analyser |
                  +-----------------+
                           |
                           V
                  +-----------------+
                  | part-of-speech  |
                  |    tagger       |
                  +-----------------+
                           |
                           V
                  +-----------------+       +-----------+
                  |   structural    |  <->  | lexical   |
                  |    transfer     |       | transfer  |
                  +-----------------+       +-----------+
                           |
                           V
                  +-----------------+
                  |  morphological  |
                  |    generator    |
                  +-----------------+
                           |
                           V
                  +-----------------+
                  | post-generator  |
                  +-----------------+
                           |
                           V
                  +-----------------+       +---------+
                  |  re-formatter   |  -->  | TL text |
                  +-----------------+       +---------+