Apertium: an open-source shallow-transfer machine translation engine for related-language pairs

APERTIUM: AN OPEN-SOURCE SHALLOW-TRANSFER MACHINE TRANSLATION ENGINE FOR RELATED-LANGUAGE PAIRS

Carme Armentano-Oller, Rafael C. Carrasco, Boyan I. Bonev, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez,

Transducens group, Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant, E-03071 Alacant (Spain).

carmentano [at] dlsi.ua.es, carrasco [at] dlsi.ua.es, bib [at] alu.ua.es, acorbi [at] dlsi.ua.es, mlf [at] ua.es, mginesti [at] dlsi.ua.es, sortiz [at] dlsi.ua.es, japerez [at] dlsi.ua.es, gema [at] internostrum.com, fsanchez [at] dlsi.ua.es

January 24, 2006

This documentation is distributed under the GNU General Public License (http://www.gnu.org/licenses/gpl.html)

Abstract. We briefly describe Apertium: an open-source shallow-transfer machine translation engine, initially aimed at related-language pairs. Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group at the Universitat d'Alacant, such as interNOSTRUM (Spanish-Catalan, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, http://traductor.universia.net). It will be possible to use Apertium to build machine translation systems for a variety of related-language pairs; to that end, the project proposes simple standard formats to encode the linguistic data needed. This paper briefly describes the machine translation engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine.

1. Introduction

[This document is largely based on the paper "An Open-Source Shallow-Transfer Machine Translation Engine for the Romance Languages of Spain", presented by Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, and Kepa Sarasola at the 10th Conference of the European Association for Machine Translation (Budapest, may 30-31, 2005).]

This document describes Apertium: an open-source shallow-transfer machine translation (MT) engine, initially aimed at related-language pairs. The shallow-transfer architecture will also be suitable for pairs of closely related languages: Romance language pairs such as Spanish-Catalan, Spanish-Portuguese, or other language pairs such as Czech-Slovak, Danish-Swedish, Kirwanda-Kiswahili, etc.

Existing MT programs are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages, and may use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.

The MT architecture proposed here uses finite-state transducers for lexical processing , hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group such as interNOSTRUM (Spanish-Catalan, Canals-Marote et al. 2001, http://www.internostrum.com/welcome.php) and Traductor Universia (Spanish-Portuguese, Garrido-Alenda et al. 2003, http://traductor.universia.net); these systems are publicly accessible through the net and used on a daily basis by thousands of users.

The MT engine and toolbox has been released in two packages: lttoolbox (containing all the lexical processing modules and tools) and apertium itself (containing the rest of the engine) under an open-source license (GPL). In addition to the toolbox, open-source data are available for three language pairs: Spanish-Catalan (apertium-es-ca) and Spanish-Galician (apertium-es-gl), developed under the OpenTrad consortium, and Spanish-Portuguese (apertium-es-pt), developed at the University of Alacant. This means that anyone having the necessary computational and linguistic skills will be able to adapt or enhance it to produce a new MT system, even for other pairs of related languages. The first version of the whole system, together with linguistic data for the Spanish-Catalan language pair was released on July 29, 2005. The description in this document applies to version 1.0 of apertium and version 1.0 of lttoolbox.

Web prototypes for all three pairs may also be tested on plain texts, RTF and HTML documents and websites at the address www.apertium.org.

We expect that the introduction of a unified open-source MT architecture will ease some of the mentioned problems (having different technologies for different pairs, closed-source architectures being hard to adapt to new uses, etc.). It will also help shift the current business model from a licence-centred one to a services-centred one, and favour the interchange of existing linguistic data through the use of the XML-based formats defined in this project.

The following sections give an overview of the architecture (sec. 2), the formats defined for the encoding of linguistic data (sec. 3), and the compilers used to convert these data into an executable form (sec. 4); finally, we give some concluding remarks (sec. 5).

2. The Apertium MT architecture

The MT strategy used in Apertium has already been described in detail (Canals-Marote et al. 2001; Garrido-Alenda et al. 2003); a sketch will be given here. The engine is a classical shallow-transfer or transformer system consisting of an 8-module assembly line; we have found that this strategy is sufficient to achieve a reasonable translation quality between related languages such as Spanish (es), Catalan (ca), Galician (gl) or Portuguese (pt). While, for these languages, a rudimentary word-for-word MT model may give an adequate translation for about 75% of the text (measured as the percentage of words in a text that do not need correction), the addition of homograph disambiguation, management of contiguous multi-word units, and local reordering and agreement rules may raise the fraction of adequately translated text above 90%. This is the approach used in the engine presented here, and we expect it to be useful for other related-language pairs.

To ease diagnosis and independent testing, modules communicate between them using text streams (examples below give an idea of the communication format used). This allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural-language processing tasks. The apertium package includes a shell script, apertium-translator which calls all modules as necessary for a given language pair and a given text format.

The modules are organized as in the diagram in Figure 1 at the end of this document. Most of the modules are capable of processing tens of thousands of words per second on current desktop workstations; only the structural transfer module lags behind at several thousands of words per second. The following sections describe each module of the shallow-transfer architecture in detail.

2.1. The de-formatter

The de-formatter (generated automatically from a formatting specification file, see 3.4) separates the text to be translated from the format information (RTF, HTML, etc.). Format information is encapsulated so that the rest of the modules treat it as blanks between words. For example, the HTML text in Spanish:

vi una señal

("I saw a signal") would be processed by the de-formatter so that it would encapsulate the HTML tags between brackets and deliver

vi[ ]una señal[]

The character sequences in brackets are treated as simple blanks between words by the rest of the modules. As usual, the escape symbol \ is used before symbols [ and ] if present in the text.

2.2. The morphological analyser

The morphological analyser (program lt-proc in package lttoolbox with option -a) tokenizes the text in surface forms (lexical units as they appear in texts) and delivers, for each surface form, one or more lexical forms consisting of lemma, lexical category and morphological inflection information. Tokenization is not straightforward due to the existence, on the one hand, of contractions, and, on the other hand, of multi-word lexical units. For contractions, the system reads in a single surface form and delivers the corresponding sequence of lexical forms (for instance, the es preposition-article contraction del would be analysed into two lexical forms, one for the preposition de and another one for the article el). Multi-word surface forms are analysed in a left-to-right, longest-match fashion; for instance, the analysis for the es preposition a would not be delivered when the input text is a través de ("through"), which is a multi-word preposition in es. Multi-word surface forms may be invariable (such as a multi-word preposition or conjunction) or inflected (for example, in es, echaban de menos, "they missed", is a form of the imperfect indicative tense of the verb echar de menos, "to miss"). Apertium offers support for many types of inflected multi-word units. The module reads in a binary file compiled from a source-language morphological dictionary (see section 3.1).

Upon receiving the example text in the previous section, the morphological analyser would deliver

^vi/ver<vblex><ifi><1><sg>$[ ]^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir<vblex><prs><3><sg>$ ^señal/señal<n><f><sg>$[]

where each surface form is analysed into one or more lexical forms. For example, vi is analysed into lemma ver, lexical category lexical verb (vblex), indefinite indicative (ifi), 1st person, singular, whereas una (a homograph) receives three analyses: un, determinant, indefinite, feminine singular, and two forms of the present subjunctive (prs) of the verb unir ("to join"). The characters "^" and "$" delimit the analyses for each surface form; lexical forms for each surface form are separatd by "/"; angle brackets "<...>" are used to delimit grammatical symbols. The string after the "^" and before the first "/" is the surface form as it appears in the source input text.

Multi-word SFs may be invariable (such as multi-word prepositions or conjunctions) or inflected (for example, in Portuguese, tinham saudades, "they missed", is a form of the imperfect indicative tense of the verb ter saudades, "to miss"). Apertium offers supports for many types of inflected multiword units.

2.3. The part-of-speech tagger

As has been shown in the previous example, some surface forms (about 30% in Romance languages) are homographs, ambiguous forms for which the morphological analyser delivers more than one lexical form. The part-of-speech tagger chooses one of them, according to the lexical forms of neighbouring words. When translating between related languages, ambiguous surface forms are one of the main sources of errors when incorrectly solved.

The part-of-speech tagger is trained from a tagger definition file (see section 3.2) and corpora data. The tagger has options --tagger for tagging (during machine translation), --train for unsupervised training and --supervised (offline, when building the machine translation system). The result of training is a file containing a hidden Markov model (HMM) which has been obtained on representative source-language texts (using an open-source training program). This file also contains the patterns to define the behavior of the tagger and the information about the ambiguity classes present during training. Two training modes are possible: unsupervised one use a larger amount (millions of words) of untagged text processed by the morphological analyser and supervised one uses a small amount of tagged text (tens of thousands of words) where a lexical form for each homograph has been manually selected. The second method usually leads to a slightly better performance (about 96% correct part-of-speech tags considering homographs and non-homographs). We are currently building a collection of open corpora (both untagged and tagged) using texts published on the web under Creative Commons licenses.

The result of processing the example text delivered by the morphological analyser with the part-of-speech tagger would be

^ver<vblex><ifi><1><sg>$[ ]^un<det><ind><f><sg>$ ^señal<n><f><sg>$[]

where the correct lexical form (determiner) has been selected for the word una.

2.4. The lexical transfer module

The lexical transfer module which is implemented inside the lttoolbox library is called by the structural transfer module (see next section); it reads each source-language lexical form and delivers a corresponding target-language lexical form. The module reads in a binary file compiled from a bilingual dictionary (see section 3.1). The dictionary contains a single equivalent for each source-language entry; that is, no word-sense disambiguation is performed. For some words, multi-word entries are used to safely select the correct equivalent in frequently-occurring fixed contexts. This approach has been used with very good results in Traductor Universia and interNOSTRUM.

Each of the lexical forms in the running example would be translated into Catalan as follows:

ver<vblex> ---> veure<vblex>

un<det> ---> un<det>

señal<n><f> ---> senyal<n><m>

where the remaining grammatical symbols for each lexical form would be simply copied to the target-language output. Note the gender change to masculine when translating señal into Catalan senyal.

2.5. The structural transfer module

From the release of Apertium 1.0 on, a generic structural transfer module interprets a slightly preprocessed version of the structural transfer specfication file (see 3.3); it uses uses finite-state pattern matching to detect (in the usual left-to-right, longest-match way) fixed-length patterns of lexical forms (chunks or phrases) needing special processing due to grammatical divergences between the two languages (gender and number changes to ensure agreement in the target language, word reorderings, lexical changes such as changes in prepositions, etc.) and performs the corresponding transformations.

Optionally, the module may be compiled from the structural transfer specification file to increase slightly the translation speed, but in this case each language pair would have a different structural transfer module (this was the usual situation until the release of Apertium 1.0).

In the running example, a determiner-noun rule is used to change the gender of the determiner so that it agrees with the noun; the result is

^veure<vblex><ifi><1><sg>$[ ]^un<det><ind><m><sg>$ ^senyal<n><m><sg>$[]

2.6. The morphological generator

The morphological generator (program lt-proc in package lttoolbox with option -g) delivers a target-language surface form for each target-language lexical form, by suitably inflecting it. The module reads in a binary file compiled from a target-language morphological dictionary (see section 3.1). The result for the running example would be

vaig veure[ ]un senyal[]

2.7. The post-generator

The post-generator (program lt-proc in package lttoolbox with option -p) performs orthographical operations such as contractions and apostrophations. The module reads in a binary file compiled from a rule file expressed as a dictionary (section 3.1). The post-generator is usually dormant (just copies the input to the output) until a special alarm symbol contained in some target-language surface forms wakes it up to perform a particular string transformation if necessary; then it goes back to sleep.

For example, in Catalan, clitic pronouns in contact may change before a verb: em ("to me") and ho ("it") contract into m'ho, em and els ("them") contract into me'ls and em and la ("her") are written me la. To signal these changes, linguists prepend an alarm to the target-language surface form "em" in target-language dictionaries and write post-generation rules to ensure the changes described.

2.8. The re-formatter

Finally, the re-formatter restores the format information encapsulated by the de-formatter into the translated text and removes the encapsulation sequences used to protect certain characters in the source text. The result for the running example would be the correct translation of the HTML text:

vaig veure un senyal

3. Formats for linguistic data

An adequate documentation of the code and auxiliary files is crucial to the success of open-source software. In the case of a MT system, this implies carefully defining a systematic format for each source of linguistic data used by the system. The formats used by this architecture are based on XML (World Wide Web Consortium, 2004) for interoperability; in particular, for easier parsing, transformation, and maintenance.

The XML formats for each type of linguistic data are defined through conveniently-designed XML document-type definitions (DTDs) which may be found inside the apertium package (available through www.apertium.org). On the one hand, the success of the OS MT engine heavily depends on the acceptance of these formats by other groups; this is indeed the mechanism by which de facto standards appear. Acceptance may be eased by the use of an interoperable XML-based format which, as mentioned, simplifies the transformation of data from and towards it, and also by the availability of tools to manage linguistic data in these formats; the current project is expected to produce transformation and management tools in a later phase. But, on the other hand, acceptance of the formats also depends on the success of the translation engine itself.

3.1. Dictionaries (lexical processing)

Monolingual morphological dictionaries, bilingual dictionaries and post-generation dictionaries use a common format, defined by DTD dix.dtd in package apertium.

Morphological dictionaries establish the correspondences between surface forms and lexical forms and contain (a) a definition of the alphabet (used by the tokenizer), (b) a section defining the grammatical symbols used in a particular application to specify lexical forms (symbols representing concepts such as noun, verb, plural, present, feminine, etc.), (c) a section defining paradigms (describing reusable groups of correspondences between parts of surface forms and parts of lexical forms), and (d) one or more labelled dictionary sections containing lists of surface form-lexical form correspondences for whole lexical units (including contiguous multi-word units). Paradigms may be used directly in the dictionary sections or to build larger paradigms (at the conceptual level, paradigms represent the regularities in the inflective system of the corresponding language).

Bilingual dictionaries have a very similar structure and establish correspondences between source-language lexical forms and target-language lexical forms, but seldom use paradigms.

Finally, post-generation dictionaries are used to establish correspondences between input and output strings corresponding to the orthographical transformations to be performed by the post-generator on the target-language surface forms generated by the generator.

3.2. Tagger definition

Source-language lexical forms delivered by the morphological analyser are defined in terms of fine part-of-speech tags (for example, the word cantábamos (es, "we sang") has lemma cantar ("sing"), category verb, and the following inflection information: indicative, imperfect, 1st person, plural), which are necessary in some parts of the MT engine (structural transfer, morphological generation); however, for the purpose of efficient disambiguation, these fine part-of-speech tags may be grouped in coarser part-of-speech tags (such as verb in personal form).

The tagger definition file is also an XML file (the corresponding DTD, tagger.dtd, may also be found in the apertium package) where (a) coarser tags are defined in terms of fine tags, both for single-word and for multi-word units, (b) constraints may defined to forbid or enforce certain sequences of part-of-speech tags, and (c) priority lists are used to decide which fine part-of-speech tag to pass on to the structural transfer module when the coarse part-of-speech tag contains more than a fine tag. The tagger definition file is used to define the behaviour of the part-of-speech tagger both when it is being trained on a source-language corpus and when it is running as part of the MT system.

3.3. Structural transfer

An XML format for shallow structural transfer rules has also been established; a commented DTD (transfer.dtd) may be found inside the apertium package.

Structural transfer rule files contain pattern--action rules which describe what has to be done for each pattern (much like in languages such as perl or lex). Patterns are defined in terms of categories which are in turn defined (in the preamble) in terms of fine morphological tags and, optionally, lemmas for lexicalized rules. For example, a commonly used pattern, determiner-noun, has an associated action which sets the gender and number of the determiner to those of the noun to ensure gender and number agreement.

Using a declarative notation such as XML is rather straightforward for the pattern part of rules but using it for the action (procedural) part means stretching it a bit; we have, however, found a reasonable way to express linguistic transformations in XML. In this way, we follow as close as possible the declarative approach used in the XML files defining the linguistic data used for the tagger and for the lexical processing modules.

3.4. De-formatter and re-formatter

De-formatters and re-formatters are generated from format management files specified by the DTD format.dtd in package apertium. These are not linguistic data but are considered in this section for convenience. Format management files for RTF (format-rtf.xml), HTML (format-html.xml) and plain ISO-8859-1 text (format-txt.xml) are provided in package apertium. Scripts apertium-gen-deformat and apertium-gen-reformat in the apertium package generate C++ de-formatters and re-formatters respectively for each format using lex as an intermediate representation.

4. Compilers and preprocessors

The Apertium toolbox contains compilers to convert the linguistic data into the corresponding efficient form used by the modules of the engine. Two main compilers are used in this project: one for the four lexical processing modules of the system and another one for the structural transfer.

4.1. Lexical processing

The lexical processor compiler (lt-comp in package lttoolbox) is very fast (it takes about a minute to compile the current dictionaries in the system) thanks to the use of advanced transducer building strategies and to the minimization of partial finite-state transducers (Roche & Schabes 1997) during construction . This makes linguistic data development much easier, because the effect on the whole system of changing a rule or a lexical item may be tested almost immediately.

The four lexical processing modules (morphological analyser, lexical transfer, morphological generator, post-generator) are implemented as a single program (lt-proc in package lttoolbox) which reads binary files containing a compact and efficient representation of a class of finite-state transducers (in particular, augmented letter transducers, Garrido-Alenda et al. 2002).

4.2. Structural transfer

The current structural transfer preprocessor (file apertium-preprocess-transfer in package apertium) reads in a structural transfer rule file (see section 3.3) and generates a file with precompiled patterns and indexes the actions of the rules of the structural transfer module specification.

As mentioned in section 2.5, structural transfer rules for a given language pair may also be compiled into a specific structural transfer module, if a slight increase in translation speed is desired (this was the default until the release of Apertium 1.0).

5. Concluding remarks

This document describes Apertium: an open-source shallow-transfer machine translation engine for related-language pairs, developed in a large, government-funded open-source development project. It may be adapted to translating between Romance languages of Europe (French, Portuguese, Italian, Occitan, etc.), between European related language pairs outside the Romance group (Danish-Swedish Czech-Slovak, etc.), or even between other related languages (for instance, Kirwanda-Swahili).

The Apertium shallow-transfer engine has not been designed from scratch but may rather be seen as a complete open-source rewriting of an existing engine (interNOSTRUM, Canals-Marote et al. 2001; Traductor Universia, Garrido-Alenda et al. 2003) which is currently used daily by thousands of people through the net, and the corresponding redesign of linguistic data formats and rewriting of compilers.

The code (in two packages, lttoolbox and apertium), together with pilot Spanish-Catalan (package apertium-es-ca), Spanish-Galician (apertium-es-gl), and Spanish-Portuguese (apertium-es-pt) linguistic data is available through http://sourceforge.net/projects/apertium/.

Acknowledgements: This work has been funded through project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism), with support from project TIC2003-08681-C02-01 (Spanish Ministry of Science and Technology). Felipe Sánchez-Martínez is supported by the Spanish Ministry of Science and Education and the European Social Fund through grant BES-2004-4711.

6. References

Canals-Marote, R., A. Esteve-Guillén, A. Garrido-Alenda, M.I. Guardiola-Savall, A. Iturraspe-Bellver, S. Montserrat-Buendia, S. Ortiz-Rojas, H. Pastor-Pina, P.M. Pérez-Antón, M.L. Forcada (2001). "The Spanish-Catalan machine translation system interNOSTRUM", in B. Maegaard, ed., Proceedings of MT Summit VIII: Machine Translation in the Information Age, 73-76.
Carreras, X., I. Chao, L. Padró and M. Padró (2004). "FreeLing: An Open-Source Suite of Language Analyzers", in M.T. Lino, M. F. Xavier, F. Ferreira, R. Costa, R. Silva, ed., Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal.
Díaz de Ilarraza, A., A. Mayor, K. Sarasola (2000). "Reusability of wide-coverage linguistic resources in the construction of a multilingual machine translation system", in Lewis, D., Mitkov, R., ed., Proceedings of MT 2000 (Univ. of Exeter, UK, 19-22 Nov. 2000), .
Garrido, A., Amaia Iturraspe, Sandra Montserrat, Hermínia Pastor, Mikel L. Forcada (1999). "A compiler for morphological analysers and generators based on finite-state transducers", Procesamiento del Lenguaje Natural, 25, 93-98.
Garrido-Alenda, A., M.L. Forcada (2001). "MorphTrans: un lenguaje y un compilador para especificar y generar módulos de transferencia morfológica para sistemas de traducción automática", Procesamiento del Lenguaje Natural, 27, 157-162.
Garrido-Alenda, A. Mikel L. Forcada, Rafael C. Carrasco (2002). "Incremental construction and maintenance of morphological analysers based on augmented letter transducers", in Mitamura, T., Nyberg, E., ed., Proceedings of TMI 2002 (Theoretical and Methodological Issues in Machine Translation, Keihanna/Kyoto, Japan, March 2002), 53-62.
Garrido-Alenda, A., Patrícia Gilabert Zarco, Juan Antonio Pérez Ortiz, Antonio Pertusa-Ibáñez, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Míriam A. Scalco, Mikel L. Forcada (2004). "Shallow parsing for Portuguese-Spanish Machine Translation", in Branco, A. and Mendes, A., Ribeiro, R., Language technology for Portuguese: shallow processing tools and resources , 135-144.
Roche, E., Schabes, Y. (1997). "Introduction", in Roche, E., Schabes, Y., Finite-state language processing , 1-65.
World Wide Web Consortium (2004). "Extensible Markup Language (XML)", http://www.w3.org/XML/.

Figure 1: Block diagram of the Apertium machine translation system

+---------+       +-----------------+
| SL text |  ---> | de-formatter    |
+---------+       +-----------------+
                           |
                           V
                  +-----------------+
                  | morph. analyser |
                  +-----------------+
                           |
                           V
                  +-----------------+
                  | part-of-speech  |
                  |    tagger       |
                  +-----------------+
                           |
                           V
                  +-----------------+       +-----------+
                  |   structural    |  <->  | lexical   |
                  |    transfer     |       | transfer  |
                  +-----------------+       +-----------+
                           |
                           V
                  +-----------------+
                  |  morphological  |
                  |    generator    |
                  +-----------------+
                           |
                           V
                  +-----------------+
                  | post-generator  |
                  +-----------------+
                           |
                           V
                  +-----------------+       +---------+
                  |  re-formatter   |  -->  | TL text |
                  +-----------------+       +---------+