A major theme at this year's edition of
IWSLT is the sharing of linguistic resources and tools among the participants to
make the evaluation more collaborative and fair. To this end, we ask that each
participant send us information about non-proprietary resources used in the
development of this year's submission so that other groups may also utilize these
resources for the various tasks. The deadline for submission ( see the schedule page ) is intended to give other participants time
to use the resources if they would like. Additional resources not included in data sets provided by IWSLT partners will be placed on this page.
It should be noted though, that participants do not have to provide resources directly. Nor are participants required to provide resources that they have acquired elsewhere and then have modified in some way (i.e. cleaned, corrected, enhanced, etc. ). In this latter example, a group would provide a reference and/or link to the original provider or creator.
Some examples of resources that can be used include:
- Publicly available aligned or monolingual corpora such as EuroParl or LDC data ( see below). It is possible that some of these resources have licensing fees but they should be "reasonable" and affordable by most research groups.
- Publicly available annotated treebanks.
Some examples of resources that can NOT be used include:
- Privately developed linguistic resources and/or corpora
- NIST or LDC data which require participation in an evaluation campaign. Some examples include data available for the GALE or TREC. ( i.e. resources with LDC catalog codes such as "LDCyyyyExx" or "LDCyyyyGxx".
- Publicly available linguistic resources which require high licensing fees.
If you have any interesting resources
that you would like to share with other participants or questions
concerning resources, you can send them to Cam Fordyce at info AT celct DOT it and put "[IWSLT07 Resources]" in the subject line.
Currently Shared Resources
The following list of resources includes resources provided by the organizers and participants. A few of the links are files. The web links have not verified. If you find broken links, please let us know and we will do our best to resolve the problem. The list below was updated on 26 June, 2007.
- The open source decoder, Moses, for SMT.
- Tool for building translation models, Giza++.
- Shell scripts for creating NBEST and 1BEST lists from the IWSLT07
lattices. The archive includes also a Perl script used to pre-process
the English reference files (development and test) and a small script for fixing a character encoding error in the Italian development data file, dev5b: Tool Archive.
- SRILM a toolkit for building and applying statistical language models (LMs).
- Champollian, a sentence aligner, which also contains a set of tools, such as an English stemmer, a Chinese segmenter etc.
- Buckwalter Arabic Morphological Analyzer Version 1.0: LDC2002L49.
- OpenNLP Toolkit.
- SilkRoad is a phrase-based statistical machine translation system, which consists of a set of utilities, such as a phrase extracter, three decoders, and some preprocess tools etc.
- Tokenizer for Penn Treebank.
- Steven Abney's Cass chunker.
- Adwait Ratnaparkhi's Maximum Entropy POS Tagger.
- Dekang Lin's NLP tools and corpora.
- Mona Talat Diab's Arabic lemmatizer/chunker.
- Masao Utiyama's NLP/MT software.
- Kyoto University's NLP tools.
- ChaSen morphological analyzer.
- Taku Kudo's NLP tools for Japanese.
- ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System).
- Stanford University's Parser for Chinese
- Universitat Politècnica de Catalunya (UPC) MARIE Ngram SMT Decoder
- Arabic Morphological Analyzer MADA+TOKAN from Columbia University: contact the authors ( Owen Rambow and Nizar Habash ). See Nizar's website.
- USC ISI's Finite State Toolkit