Showing posts with label Neural Machine Translation. Show all posts

Friday, September 8, 2023

MemoQ Unsigned OPUSCAT Plug-in Warning

To get rid of the memoQ unsigned plug-in warning that OpusCAT triggers, you need to create a ClientDevConfig.xml file under ProgramData/MemoQ containing just the following code:

<?xml version="1.0" encoding="utf-8"?>
<ClientDevConfig>
<LoadUnsignedPlugins>true</LoadUnsignedPlugins>
</ClientDevConfig>

Source: https://github.com/Helsinki-NLP/OPUS-CAT

Try:

Installation:

Download and unpack the zip file.
Copy the ClientDevConfig.xml file to your %programdata%/MemoQ folder. This is usually located at C:\Users\\$USERNAME\AppData\Roaming\MemoQ or C:\Users\\$USERNAME\AppData\Local\MemoQ, but note that the AppData folder is hidden, so you may need to enable "Show hidden items" in the Windows Explorer settings.
Copy the .dll file to the Addins folder of your MemoQ installation (C:\Program Files\memoQ\memoQ-9\Addins). Make sure to delete any previous plugin files.

Troubleshooting:

If an error message regarding CAS policy is displayed upon MemoQ startup, unzip it using 3rd party software (such as WinRar). This is because Windows flags .dll files that have been downloaded on their own or inside a .zip file that is unpacked using Windows Explorer.
If you notice incompatibility issues or other errors, please open an issue under this repository.

Saturday, June 3, 2023

ChatGPT for Translation

ChatGPT's interactive nature makes it a standout translation tool. With other translation tools, you provide a text, you get a translation, and that's it. Whether it is the best translation you can get doesn't matter—you're stuck with it.

With ChatGPT, you can customize translations to suit your specific needs and provide feedback on adjustments you'd love to see. For example, you can adjust the tone and style and take into account some cultural connotations and regional differences in the meaning of words, something purpose-built translation tools like Google Translate can not do.
All you need to do is provide the text you want to translate, specify the language you want to translate it to, and ChatGPT will handle the rest.

1. Provide Context

One of the key advantages of ChatGPT over popular translation tools like Google Translate is its ability to accurately consider the context of a text when generating translations. Considering context can be the difference between simply translating individual words in a sentence and generating a translation that truly reflects the author's or speaker's intention.

Take the Spanish sentence “Gracias por preguntar, pero estoy bastante seguro aquí” for instance. Google Translate produces "Thanks for asking, but I'm pretty sure here" as the translation. While this isn't entirely wrong, depending on the context, the sentence could mean, "Thanks for asking, but I'm safe here."
Of course, Google Translate will provide the same translation no matter how many times you attempt to translate it because it doesn't have a way to recognize contextual nuance. As per the screenshot above, ChatGPT will attempt to provide the most accurate translation depending on the provided context. Providing context can significantly improve the quality of your translation. If you are not sure how to provide context, here are some inspirations:

    "Translate [text to translate in Filipino] to English from the perspective of a native Filipino speaker" should try to maintain as many cultural connotations as possible in a translation.
    "Translate [text to translate] to English from the perspective of someone discussing the COVID-19 pandemic" should use appropriate medical terms instead of generic words.
    "Translate [text to translate] to English. The text discusses a battle during WWII" should use appropriate military and historical terms.

2. Declare the Type of Text

Another important factor that can increase the accuracy of your translation is outright declaring the kind of text you're trying to translate. For example, is it an idiom, a song, a financial document, or an ordinary text? Simply letting ChatGPT know what you're trying to translate gives the chatbot an edge toward providing more accurate translations.

Instead of simply using a prompt like "Translate [text to translate] to [target language]." You should ideally use alternatives like:

    Translate the [Financial report | poem | song | Bible portion | proverb] in quotes to [target language]
    Translate [text to translate] to [target language]. The text to be translated is a [military report | Medical document | Drug prescription]

The prompts above or similar ones help ChatGPT use relevant or industry-specific context when generating a translation. Although ChatGPT sometimes recognizes the right niche words to use for translation, you'll have to explicitly prompt it to do so using type declaration in some cases.
3. Use Style Transfer

Sometimes, when translating text, the translation might be too technical or simply inappropriate for the target audience. Using style transfer in ChatGPT can help adjust the tone and style of a translation to match the target audience or industry. So, if you're translating a legal document, the translation could retain the author's original meaning while using more layman's wording. In the example below, I translated a soccer commentary from Spanish to English, first without style transfer and then using style transfer.
The translation above uses the closest English equivalent of the corresponding Spanish words, while the one below uses words suitable for an audience not acquainted with soccer terms. However, it is interesting to note that both translations are considered accurate.
An English translation using style transfer

To use style transfer when translating, use prompts like:

    Translate [text to translate] to [target language] in layman's terms.
    Translate [text to translate] to [target language] for a [grade 5] audience
    Translate [text to translate] to [target language]. Use style transfer to make the translated text suitable for a [target audience]

4. Account for Regional Differences

Some words may have different meanings or connotations depending on the region or country of the speaker. For instance, the English sentence "I'm going to play football" could translate to "我要去踢足球 (Wǒ yào qù tī zúqiú)" in Chinese. While this seems like a perfect translation, if the speaker was American, the translation could be wrong. By saying "football," an American speaker would likely be referring to the rugby-style sport called American football instead of the football known by the rest of the world.

Regular translation tools have no way to account for this potential misinterpretation. ChatGPT, on the other hand, can provide varying translations depending on the speaker's origin.

We prompted ChatGPT to translate "I'm going to play football" into Chinese. As expected, it produced "我要去踢足球 (Wǒ yào qù tī zúqiú)." In Chinese, "zúqiú" means "football," which refers to soccer rather than the rugby-style sport.
ChatGPT translation accounting for regional differences in meaning

We repeated the translation prompt but added hints about the speaker's origin and possible intent. ChatGPT changed the translation to "我要去踢橄榄球 (Wǒ yào qù tī gǎnlǎnqiú)," this time using "gǎnlǎnqiú" which is the Chinese term for American football and better reflects the potential intent of the speaker.
5. Use Summarized Translation

Sometimes, you don't want to read the entirety of a text. You just want to understand the message the author or speaker is trying to pass across. ChatGPT is one of the few translation tools you can rely on for situations like this. To get a summarized translation, ask ChatGPT to provide a "summarized" or "condensed" translation of the target text. Some prompts examples include:

    Provide a descriptive but condensed translation of [text to translate] in Spanish.
    Provide a Summarized translation of [text to translate] in French.
    Provide a summarized translation of [text to translate] in English.
    Translate this article into Dutch, but only include the key points.

6. Use a Fine-Tuned Instance of ChatGPT

Using a fine-tuned instance of ChatGPT is one of the best ways to utilize ChatGPT as a translation tool. It opens up almost endless possibilities for translation using the AI chatbot. But how can you fine-tune ChatGPT for translation?

You can do it in several ways. A key component to fine-tuning ChatGPT for translation is laying out rules the chatbot must follow when translating any text you provide. For instance, you can fine-tune ChatGPT by providing a word-translation pair or a text-translation pair. Here's an example below:

While trying to translate a Pidgin text into English, we ran into some wrongly translated words. Providing the word-translation pairs below made ChatGPT update its translation of the words in subsequent translations.
Fine-tunning ChatGPT for translation tasks

You can also make ChatGPT translations more accurate by providing several large texts and their verified translations. You can then prompt ChatGPT to deduce the right translation of words and phrases from the provided samples and apply it when translating text involving a similar language pair. While you can use significantly longer texts to fine-tune ChatGPT translations, below is a short illustration of how it works using a short paragraph.
Providing a parralel corpus of text for ChatGPT

We achieved improved translation with every prompt without taking any further steps.

Don't Rely Solely on Machine Translation

While ChatGPT is an impressive translation tool, it's important to remember that it is still a machine and may not always produce the best translation. So don't rely solely on it, especially for important or sensitive documents. Instead, try a combination of tools, and whenever possible, consider using a professional translator to proofread to ensure accuracy.

Source: makeuseof.com

Other resources: https://chatdico.com/

Sunday, December 27, 2020

Open Source Machine Translation Systems

NMT

System	Team	Description	Link	Framework
Tensor2Tensor	Google Brain	Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.	https://github.com/tensorflow/tensor2tensor	Tensorflow
Fairseq	Facebook Research	Facebook AI Research Sequence-to-Sequence Toolkit written in Python.	https://github.com/pytorch/fairseq	Pytorch
facebookresearch/fairseq	Facebook Research	Facebook AI Research Sequence-to-Sequence Toolkit	https://github.com/facebookresearch/fairseq	Lua
tensorflow/nmt	Google Brain	TensorFlow Neural Machine Translation Tutorial	https://github.com/tensorflow/nmt	Tensorflow
OpenNMT-tf	OpenNMT	Neural machine translation and sequence learning using TensorFlow	https://github.com/OpenNMT/OpenNMT-tf	Tensorflow
OpenNMT-py	OpenNMT	Open Source Neural Machine Translation in PyTorch	https://github.com/OpenNMT/OpenNMT-py	Pytorch
THUMT	Tsinghua Natural Language Processing Group	Transformer, Multi-GPU training & decoding, Distributed training	https://github.com/THUNLP-MT/THUMT	Tensorflow/Theano
NiuTrans.NMT	NiuTrans	Transformer and FFN-LM based on NiuTrans.Tensor by NiuTrans Team.	https://github.com/NiuTrans/NiuTrans.Tensor	C/C++
MARIANNMT	Adam Mickiewicz	Pure C++ with minimal dependencies, one engine for GPU/CPU training and decoding	https://marian-nmt.github.io/	C++
Seq2Seq	Britz Denny and Goldie, Anna and Luong Thang and Le Quoc	A general-purpose encoder-decoder framework for Tensorflow	https://github.com/google/seq2seq	Tensorflow
NEMATUS	The Natural Language Processing Group at the University of Edinburgh	Support for RNN and Transformer architectures, multi-GPU support, server mode	https://github.com/EdinburghNLP/nematus	Tensorflow
Sockeye	Awslabs	A sequence-to-sequence framework for Neural Machine Translation	https://awslabs.github.io/sockeye/	MXNet
CytonMT	Wang, Xiaolin and Utiyama, Masao and Sumita, Eiichiro	An Efficient Neural Machine Translation Open-source Toolkit Implemented in C++	https://github.com/arthurxlw/cytonMt	C++
OpenSeq2Seq	NVIDIA	Modular architecture,support for mixed-precision training,fast Horovod-based distributed training	https://nvidia.github.io/OpenSeq2Seq/html/index.html	TensorFlow
nmtpytorch	The Language and Speech Team of Le Mans University	Various end-to-end neural architectures	https://github.com/lium-lst/nmtpytorch	Pytorch
DL4MT	Cho Lab at NYU CS and CDS	A multi-encoder, multi-decoder or a multi-way NMT model	https://github.com/nyu-dl/dl4mt-multi	Theano
ModerNMT	Marco, Trombetti and Davide, Caroselli and Nicola, Bertoldi	A context-aware, incremental and distributed general purpose Neural Machine Translation technology based on Fairseq Transformer model	https://github.com/ModernMT/MMT	PyTorch
UnsupervisedMT	Facebook Research	Seq2seq, biLSTM + attention, Transformer. Ability to share an arbitrary number of parameters. Denoising auto-encoder training.	https://github.com/facebookresearch/UnsupervisedMT	PyTorch

SMT

System	Team	Description	Link	Framework
Moses	moses-smt	A free software, statistical machine translation engine that can be used to train statistical models of text translation from a source language to a target language	http://www.statmt.org/moses/	C++
GIZA++	moses-smt	A SMT toolkit that is used to train IBM Models 1-5 and an HMM word alignment model	https://github.com/moses-smt/giza-pp	C++
NiuTrans.SMT	NiuTrans	NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language.	https://github.com/NiuTrans/NiuTrans.SMT	C/C++
UCAM-SMT	The MT group in Cambridge	The Cambridge Statistical Machine Translation system	http://ucam-smt.github.io/	C++
Jane	The RWTH Aachen University	Supports state-of-the-art techniques for phrase-based and hierarchical phrase-based machine translation	http://www-i6.informatik.rwth-aachen.de/jane/	C++
Phrasal	Stanford NLP Group	A state-of-the-art statistical phrase-based machine translation system	https://nlp.stanford.edu/phrasal/	Java
cdec	The Language Technologies Institute in Carnegie Mellon University	A decoder, aligner, and learning framework for SMT and similar structured prediction models	http://www.cdec-decoder.org/	C++
JOSHUA	Juri Ganitkevitch and Matt Post	A SMT decoder for phrase-based, hierarchical, and syntax-based machine translation	https://cwiki.apache.org/confluence/display/JOSHUA/	Java

Source: https://github.com/NiuTrans/MT-paper-lists

Sunday, February 16, 2020

Autohotkey for Google Translate

^!r::Reload ; Assign Alt-Ctr-R as a hotkey to restart the script.
#x:: ; Windows+X - Translate segment MemoQ, Across, Trados Studio with Google Translate
FileEncoding, utf-8
WinGetActiveTitle, Title
IfInString, Title, memoQ
{
Send ^+s
Sleep, 200
SendInput ^{F8}
Sleep, 300
SendInput ^a
Sleep, 300
SendInput ^c
Clipwait
Send ^c
Sleep, 300
Clipwait
Sleep, 200
}
IfInString, Title, Across
{
Send !{PgDn}
Sleep, 200
SendInput ^a
Sleep, 300
}
IfInString, Title, Trados Studio
{
Sleep, 300
Send ^{Ins}
Sleep, 200
SendInput ^a
Sleep, 300
}
SendInput ^c
Clipwait
Send ^c
Sleep, 300
Clipwait
Sleep, 300
Clipwait
Sleep, 300
searchtext := clipboard
;StringReplace, searchtext, searchtext, .%A_SPACE%, ._, All ;StringReplace, searchtext, searchtext, %A_SPACE%., _., All
;StringReplace, searchtext, searchtext, `;%A_SPACE%, `;_, All, ; `%5F - for underscore
;StringReplace, searchtext, searchtext, `.%A_SPACE%, `._, All ;Msgbox %searchtexturlencoded%
searchtexturlencoded := UriEncode(searchtext)
;searchtexturlencoded := URLEncoding(searchtext)
; StringReplace, searchtext, searchtext, %A_SPACE%&%A_SPACE%, `%26, All
; StringReplace, searchtext, searchtext, %A_SPACE%, +, All
Sleep, 300
RunWait, wget.exe -U "Mozilla/5.0 Chrome/62.0.3202.94" "http://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=de&dt=t&q=%searchtexturlencoded%" -O source.out.txt,, Hide
FileRead, targettext, source.out.txt ;Msgbox %targettext%
StringTrimLeft, targettext, targettext, 4
StringTrimRight, targettext, targettext, 73
quotes := ""","""
Loop, 4 ;MsgBox, Iteration number is %A_Index%.
{
addedvalue := (A_Index - 1)
textbreak = "`,null`,null`,%addedvalue%`]`,`[" ;MsgBox Textbreak %textbreak%
IfInString, targettext, %textbreak%
{
StringReplace, targettext, targettext, %textbreak%, ‡, All ;Msgbox targettext . %targettext%
}
}
occurences =
Loop, Parse, targettext, ‡
{
loopingtext = %A_LoopField% ;Msgbox Loopfield: %loopingtext%
StringLen, targetlength, loopingtext
StringGetPos, posghilimele, loopingtext, %quotes% ;MsgBox Poziţie glilimele %posghilimele% Msgbox Target length %targetlength%
StringLen, targetlength, loopingtext
StringTrimRight, occurence, loopingtext, (targetlength - posghilimele)
occurences .= occurence ;Msgbox occurence after add %occurences% ;occurences := occurences . occurence
}
StringReplace, occurences, occurences, ș, ş, All
StringReplace, occurences, occurences, Ș, Ş, All
StringReplace, occurences, occurences, ț, ţ, All
StringReplace, occurences, occurences, Ț, Ţ, All
StringReplace, occurences, occurences, %A_SPACE%._%A_SPACE%, .%A_SPACE%, All
StringReplace, occurences, occurences, `;%A_SPACE%_, `;%A_SPACE%, All
StringReplace, occurences, occurences, `;%A_SPACE%%A_SPACE%, `;%A_SPACE%, All
StringReplace, occurences, occurences, `._, `.%A_SPACE%, All
occurences := UnSlashUnicode(occurences)
Clipboard = %occurences% ;googleout := Clipboard
Send ^v
Reload
; IfInString, Title, memoQ {Send ^{Enter} ;Send ^+{Enter}Send !{Up}}
return

uriDecode(str) {
    Loop
If RegExMatch(str, "i)(?<=%)[\da-f]{1,2}", hex)
    StringReplace, str, str, `%%hex%, % Chr("0x" . hex), All
    Else Break
return

, str
}

UriEncode(Uri, RE="[0-9A-Za-z]"){
    VarSetCapacity(Var,StrPut(Uri,"UTF-8"),0),StrPut(Uri,&Var,"UTF-8")
    While Code:=NumGet(Var,A_Index-1,"UChar")
    Res.=(Chr:=Chr(Code))~=RE?Chr:Format("%{:02X}",Code)
    return, Res
}

UnSlashUnicode(s) ; unslash unicode sequences like \u0026
{
rx = \\u([0-9a-fA-F]{4})
pos = 0
loop
{
pos := RegExMatch(s,rx,m,pos+1)
if (pos = 0)
break
StringReplace, s, s, %m%, % Chr("0x" . SubStr(m,3,4))
}
return, s
}

Friday, January 3, 2020

Traduceri ale landurilor germane în limba română + capitale

Land în germană	Land în română	Capitală
Baden-Württemberg	Baden-Württemberg	Stuttgart
Freistaat Bayern	Statul Liber Bavaria	München
Freistaat Berlin	Orașul Liber Berlin	Berlin
Brandenburg	Brandenburg	Potsdam
Freie und Hansestadt Bremen	Orașul Liber și Hanseatic Brema	(Brema și Bremerhaven)
Freie und Hansestadt Hamburg	Orașul Liber și Hanseatic Hamburg	Hamburg
Hessen	Hessa	Wiesbaden
Mecklenburg-Vorpommern	Mecklenburg - Pomerania Inferioară	Schwerin
Niedersachsen	Saxonia Inferioară	Hannover (Hanovra)
Nordrhein-Westfalen	Renania de Nord - Westfalia	Düsseldorf
Rheinland-Pfalz	Renania-Palatinat	Mainz
Saarland	Saarland	Saarbrücken
Freistaat Sachsen	Statul Liber Saxonia	Dresden (Dresda)
Sachsen-Anhalt	Saxonia-Anhalt	Magdeburg
Schleswig-Holstein	Schleswig-Holstein	Kiel
Freistaat Thüringen	Statul Liber Turingia	Erfurt

Sunday, September 8, 2019

GIZA++ to Obtain Word Alignment Between Bilingual Sentences

Problem

Word alignment is mapping of words between two sentences that have the same meaning in two different languages. Let's say we have an English and a Spanish sentence:

  I saw a white bird on my way home.
  Vi un pájaro blanco camino a casa.

Then words 'I saw' <-> 'Vi', 'white' <-> 'blanco', 'bird' <-> 'pájaro', etc. correspond between two sentences. Notice that words do not correspond one-to-one. For example, 'on my way' in English is translated as 'camino' in Spanish. Also, word order may also be different across languages, e.g., 'white bird' in English becomes 'pájaro blanco' in Spanish since Spanish adjectives are put after nouns. Given a large corpus of bilingual sentences (bitext), we would like to compute this word alignment automatically.

Solution

GIZA++ is a toolkit to train word alignment models. GIZA++ supports IBM Model 1 to 5, now classic but most widely used unsupervised word alignment models to date. Let's use bilingual sentences from Tatoeba project to begin with. We can use the Tatoeba.org preprocessing script to extract bilingual sentences from sentence and link dumps downloaded from their downlaod page:

python preprocessors/tatoeba/create_bitext.py --languages spa_eng --sentences sentences.csv \
    --links links.csv > tatoeba_es_en.tsv

We extract bilingual sentences in Spanish (source language) and English (target language). In all the examples in this article, please replace file names with ones with appropriate path. We recommend creating a separate work directory to run GIZA++. Next, we use the script bundled with Moses machine translation system to tokenize text in each language (We'll cover Moses in another article):

cut -f3 tatoeba_es_en.tsv | mosesdecoder/scripts/tokenizer/tokenizer.perl -l es \
    > tatoeba_es_en.tsv.es
cut -f6 tatoeba_es_en.tsv | mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    > tatoeba_es_en.tsv.en

Then go to directory giza-pp/GIZA++-v2 and run:

./plain2snt.out tatoeba_es_en.tsv.es tatoeba_es_en.tsv.en

which will generate vcb (vocabulary) files and snt (sentence) files, containing the list of vocabulary and aligned sentences, respectively.
IBM Model 4 and 5 use word classes to model distortion - a concept to model how word order changes across languages, as in the 'white bird' and 'pájaro blanco' example. mkcls is a program to automatically infer word classes from a corpus using a maximum likelihood criterion, which can be run as follows (in the giza-pp/mkcls-v2 directory):

./mkcls -ptatoeba_es_en.tsv.es -Vtatoeba_es_en.tsv.es.vcb.classes
./mkcls -ptatoeba_es_en.tsv.en -Vtatoeba_es_en.tsv.en.vcb.classes

See the paper by Franz Och for the details of this word clustering.
Finally, use the following command to run GIZA++:

./GIZA++ -S tatoeba_es_en.tsv.es.vcb -T tatoeba_es_en.tsv.en.vcb \
    -C tatoeba_es_en.tsv.es_tatoeba_es_en.tsv.en.snt -o [prefix] -outputpath [output]

Here, [prefix] and [output] are the prefix used for output files and the directory where output files are saved, respectively. This will generate a bunch of output files with cryptic names. Among them, probably the most important ones are [prefix].A3.final and [prefix].ti.final, which contain the actual Viterbi alignment and the lexical translation table, respectively.

Discussion

If you see ERROR: NO COOCURRENCE FILE GIVEN! when running GIZA++, you may need to change Makefile to compile GIZA++
Before:

CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE
    -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE

After:

CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE

(Not sure why -DWORDINDEX_WITH_4_BYTE is duplicated. )
Also, depending on your environment, you may need to modify some of the source files as written in this article:

perl -pi -w -e 's/<tr1\//</g;' GIZA++-v2/* mkcls-v2/*
perl -pi -w -e 's/using namespace std::tr1;//g;' GIZA++-v2/* mkcls-v2/*
perl -pi -w -e 's/std::tr1:://g;' GIZA++-v2/* mkcls-v2/*
sed '36d' mkcls-v2/mystl.h > mkcls-v2/mystl.h.tmp
sed '50d' mkcls-v2/mystl.h.tmp > mkcls-v2/mystl.h
rm mkcls-v2/mystl.h.tmp

Finally, if you are using an operating system with a case-insensitive file system (e.g., Windows or OS X), you may need to modify Line 321 of model3.cpp as follows to prevent the .A3.final file from being overwritten by the .a3.final file:

-  alignfile = Prefix + ".A3." + number ;
+  alignfile = Prefix + ".VA3." + number ;   // "VA" for "Viterbi Alignment."" Can be anything.

Here are some details of input/output file format for GIZA++:

vcb (vocabulary) file
- This file contains a list of (uniq_id, string, number of occurrences) for each word.
snt (sentence alignment) file
- This file contains a list of three lines - the number of times this sentence pair occurred, source sentence (with each token replaced with its uniq_id), and target sentence in the same format.
T-tables (.ti.) file
- This is the final inverse T-tables (lexical translation probability) trained by the model. Lexical translation probability t(e|f) is the probability that word f in the source language is translated to word e in the target language. Since this is the inverse T-tables, it contains t(f|e). The file with actual in its filename contains actual word strings instead of unique IDs. This is an excerpt of the [prefix].actual.ti.final file trained from the Tatoeba corpus:

bird alimentador 3.07873e-06
bird apariencias 0.00452353
bird ave 0.0917504
bird aves 0.00720495
bird jaula 0.00635498
bird madruga 0.00524571
bird madrugador 0.0061525
bird pajarito 0.00875974
bird pájaro 0.814385
bird pájaros 0.053743
bird reluce 0.0018736
bird área 3.86079e-06

Since it contains t(f|e), you can confirm that summing over the source (in this case, Spanish) words gives a probability 1.0.

A (.A3.) file
- This file contains Viterbi Alignment, which is the most probable alignment (the one that maximizes the alignment probability). One particular sentence pair of this file looks like:

# Sentence pair (8597) source length 8 target length 10 alignment score : 1.66432e-09
I saw a white bird on my way home.
NULL ({ 6 }) Vi ({ 1 2 }) un ({ 3 }) pájaro ({ 5 }) blanco ({ 4 }) camino ({ 7 8 }) a ({ })
    casa ({ 9 }) . ({ 10 })

The first line shows the length (number of words) of the source (Spanish) and target (English) sentences, along with the Viterbi alignment score mentioned above.
The second line is the target sentence, and the third line is the source sentence annotated with alignment information. Each source word is annotated with the set of indices of target words that are aligned to that source word. Note that in IBM Models assume that one target word is aligned at most one source word. The first NULL ({ 6 }) means the 6th target word ('on') is not aligned to any words, and Vi ({ 1 2 }) means the first and second target words 'I saw' are aligned to 'Vi'.
Source: http://masatohagiwara.net

Sunday, October 21, 2018

Statistical natural language processing and corpus-based computational linguistics

Tools: Machine Translation, POS Taggers, NP chunking, Sequence models, Parsers, Semantic Parsers/SRL, NER, Coreference, Language models, Concordances, Summarization, Other
Corpora: Large collections, Particular languages, Treebanks, Discourse, WSD, Literature, Acquisition
SGML/XML
Dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational Resources
Mailing lists
Other stuff on the Web: General, IR, IE/Wrappers, People, Societies

Tools

Machine Translation systems

Instructions

Building a baseline statistical phrase MT system: Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system: NAACL 2006 WMT or 2009 WMT.

Freely downloadable

Moses: The most-used open-source phrase-based MT decoder. By Philip Koehn and many others.
Phrasal: A Java phrase-based MT decoder, largely compatible with the core of Moses,with extra functionality for defining feature-rich ML models. By Daniel Cer, Michel Galley, Spence Green, and others.
Joshua: A Java hierarchical MT decoder, largely based on the design of Hiero. By Chris Callison-Burch and others.
Jane: A phrase-based MT decoder by the U. Aachen group.
cdec: A primarily SCFG-based MT decoder by Chris Dyer and many others. C++.
EGYPT system: System from 1999 JHU workshop. Mainly of historical interest.
GIZA++ and mkcls: Franz Och. C++. GPL. Still often used for word alignment.
Thot: Phrase-based model building kit
Phramer: An Open-Source Java Statistical Phrase-Based MT Decoder
Syntax Augmented Machine Translation via Chart Parsing: Andreas Zollmann and Ashish Venugopal

Free, but getting them requires hassle

Pharaoh decoder: Philip Koehn, ISI.
MTTK: Machine Translation Tool Kit. Deng and Byrne.

Part of Speech Taggers

Freely downloadable

Stanford POS tagger: Loglinear tagger in Java (by Kristina Toutanova)
hunpos: An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.
MBT: Memory-based Tagger: Based on TiMBL
TreeTagger: A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.
SVMTool: POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST): Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger: Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
fnTBL: A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
mu-TBL: An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
YamCha: SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
QTAG Part of speech tagger: An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
The TOSCA/LOB tagger.: Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
The venerable Brill's Transformation-based learning Tagger: A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the Wikipedia page or you could try a reimplementation such as fnTBL.
Original Xerox Tagger: A common lisp HMM tagger available by ftp.
Lingua-EN-Tagger: Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)

Free, but require registration

TATOO: The ISSCO tagger. HMM tagger. Need to register to download.
PoSTech Korean morphological analyzer and tagger: Online registration.
TnT - A Statistical Part-of-Speech Tagger: Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.

Usable by email or on the web, but not distributed freely

Memory-based tagger: From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
Birmingham tagger: Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
CLAWS tagger: The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn't seem to link to the C7 tagset.
The AMALGAM tagger: The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
Xerox XRCE MLTT Part Of Speech Taggers: Tags any of 14 languages (European and Arabic), online on the web.
Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.

Not free

Lingsoft: Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing info@lingsoft.fi. There is an online demo.
Conexor: Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
Xerox: Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.
Infogistics: Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

No longer available

LT POS and LT TTT: The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.

NP chunking

Downloadable

YamCha: SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
Mark Greenwood's Noun Phrase Chunker: A Java reimplementation of Ramshaw and Marcus (1995).
fnTBL: A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Generic sequence models

Downloadable

CRF++: Generic CRF-based model in C++. Open source. By the author of YamCha.
Carafe: Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner.
FreeLing: A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.

Parsers

Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.

Semantic Parsers

Downloadable

ASSERT: PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
Shalmaneser: FrameNet-based by Katrin Erk.
Tree Kernels in SVMlight by Alessandro Moschitti.: A general package, but it has particularly been used for SRL.

Named Entity Recognition

Downloadable

Stanford Named Entity Recognizer: A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.
LingPipe: Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
YamCha: SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Coreference (Anaphora) Resolution

Downloadable

Stanford Deterministic Coreference Resolution System: Winner of CoNLL 2011 shared task, with subsequent improvements. Distributed as part of Stanford CoreNLP. Heeyoung Lee and others. Java. GPL.
Reconcile: By Ves Stoyanov and others. Java. GPL.
Illinois Coreference Package: Java. University of Illinois Research and Academic Use License.
Berkeley Coreference Resolution: Greg Durrett et al. Mainly Scala. GPL.
BART: A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components.
Guitar: Java. GPL.

Language modeling toolkits

Downloadable

IRSTLM Toolkit Compatible with SRILM, suitable for very large language models. LGPL. By Marcello Federico, Nicola Bertoldi et al.
CMU-Cambridge Statistical Language Modeling toolkit

Downloadable, but requires registration

The SRI Language Modeling toolkit: by Andreas Stolcke is another good system for building language models, freely available for research purposes.

Not yet classified

Lextools: A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/

Friendly concordancing and text analysis tools

Wordsmith Tools (Mike Scott): The thing to get if you are working in the Windows world.

Text summarization tools

A prototype Java Summarisation applet (System Quirk)
MEAD: A public domain portable multi-document summarization system. (Dragomir Radev and others.)

Other

Downloadable

Tilburg University's TiMBL: Tilburg's Memory Based Learner by Walter Daelemans et al. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications.
splitta: Statistical sentence boundary detection by Dan Gillick.
Time Expression taggers: TIMEX2 standard taggers (site at Mitre).
NLTK: An open source Python package for NLP application development with tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
Ted Pedersen's code: Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
ISIP tools: The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
Mem. A Perl implementation of Generalized and Improved Iterative Scaling: by Hugo WL ter Doest.
Automorphology: A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
Wordnet: Wordnet is available by ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an Italian/English/Spanish MultiWordNet and there's now a site for Global Wordnet. (See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, and WordNet Domains (coarse-grained sense topic classifications).)
Penn XTAG project: A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
Dan Melamed's Assorted Tools: A collection of various tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
MULTEXT: Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis veronis@univ-aix.fr. Some stuff including a multilingual text editor is downloadable. MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.
Naive Bayes algorithm: Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
HDDI: Text Data Mining API from Lehigh University.
Emdros: a text database engine for linguistic analysis and research
Chasen: Japanese morphological analyzer. Descendent of JUMAN.

Free, but require registration

Stuttgart's IMS Corpus Workbench (CWB): A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
Gate: University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
MITRE's Alembic Workbench: A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.
SNoW: SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).

Unsure

INTEX: a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein silberz@ladl.jussieu.fr

The PennTools page collects information on a variety of NLP systems, many of which are available externally.

Corpora

Large collections aimed at the NLP community

LDC (Linguistic Data Consortium) and its catalogue by year.: Email: ldc@ldc.upenn.edu. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available).
European Language Resources Association and its catalogue.: Distribution agency is ELDA. Rapidly growing collection of materials in European languages.
ICAME (International Computer Archive of Modern English): Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message help to fileserv@nora.hd.uib.no, by ftp to nora.hd.uib.no. Also, manuals for these corpora.
Reuters @ NIST
Reuters corpora are now distributed by NIST.
TRACTOR: TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
CLR (Consortium for Lexical Research): Email: lexical@nmsu.edu. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to clr.nmsu.edu. Their catalog is available as a postscript file.
OTA (Oxford Text Archive): Provides mainly literary texts. Has a bright new web site. Email: info@ota.ahds.ac.uk. Most materials are available on the web or by anonymous ftp to ota.ox.ac.uk. Some require negotiations with the providers.
Leipzig Corpora Collection: Sentence collections in MySQL database for 17 mainly European languages.
BNC (British National Corpus): A 100 million word corpus of British English. You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition.
European Corpus Initiative Multilingual Corpus I (ECI/MCI): A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
Survey of English Usage: At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).
International Corpus of English (ICE): Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.
Corpora held by Lancaster University: This link provides its own annotations.
The European Language Activity Network: Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
Talkbank.: Rich video and transcripts.

Particular languages

English

English language corpora available from the sites above are not repeated here.

Corpora by Geoffrey Sampson's team: The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).
Michigan Corpus of Academic Spoken English (MICASE). 1.7 million words from 1997-2001.
Penn-Helsinki Parsed Corpus of Middle English: A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
Corpus of Professional, Spoken American-English (CPSA): 2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
Lancaster Parsed Corpus
Dialogue Diversity Corpus (Bill Mann)
American National Corpus

Chinese

English language corpora available from the sites above are not repeated here.

The Lancaster Corpus of Mandarin Chinese (LCMC): By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.

Multilingual

JRC-Acquis: A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.
EMILLE/CIIL: Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.
OPUS: An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
World Health Organization Computer Assisted Translation page.: Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page.)
Searchable Canadian Hansard French-English parallel texts (1986-1993): From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
European Union web server: Parallel text in all EU languages. (In particular try European legislation.)
TELRI CD-ROMs: Parallel and other text in central and eastern european languages.

Bosnian

The Oslo Corpus of Bosnian Texts.

Czech

Parallel Czech-English: Literature translations in Czech and English
Czech National Corpus project: SYN2000: 100 million words of contemporary Czech.

French

Association des Bibliophiles Universels: Various French literary works.
American and French Research on the Treasury of the French Language (ARTFL): 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).

German

COSMAS Corpus: Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection
NEGRA Corpus: Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.

Russian

Russian National Corpus: 150 million words, 5 million words POS-tagged, some in dependency treebank.
Library of Russian Internet Libraries: Various literary works.

Slovene

Slovene-English parallel corpus: 1 M words, free to download + on-line concordances.
Coming soon: Slovene reference corpus of 100 M words

Croatian

Croatian National Corpus: 100 M words

Spanish and Portuguese

TychoBrahe Parsed Corpus of Historical Portuguese: Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
Information about Mark Davies' collection of (mainly historical Spanish and Portuguese.: It's not clear what their availability is.
The CUMBRE corpus. Contact Professor Aquilino Sánchez
The CRATER Spanish corpus: Morphosyntactically tagged telecommunication manuals) is available by ftp.
Corpus resources for Portuguese: In total about 70 million words, available free, from various sources (newswire, etc.)
Folha de S. Paulo newspaper: 4 annual CDROMs with full text.
COMPARA: Portuguese-English parallel corpus. (In general, various resources at Linguateca site.
See also under ELRA, above.

Swedish

Spraakdata, Department of Swedish, Göteborgs University.: Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.

Treebanks

Name	Language	Size	Availability	Comments
Penn Treebank	US English	2 million + words	Available (distributed by LDC)	1 million WSJ, 1 million speech, surface syntax (1970s TG)
BLLIP WSJ corpus	US English	30 million words	Available (distributed by LDC)	WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking
ICE-GB	UK English	1 million words (83,394 sentences)	Available; c. 500 pounds	British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material.
Bulgarian Treebank	Bulgarian	n/a	POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement)	An under construction Bulgarian HPSG treebank.
Penn Chinese Treebank	Chinese	100,000 words	Available (LDC)	Based on Xinhua news articles. 1980s-style GB syntax.
The Prague Dependency Treebank 1.0	Czech	500,000 words	Free on completion of license agreement (available through LDC).	Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines.
Danish Dependency Treebank 1.0	Danish	100,000 words	Available free under the GPL.	Built on a portion of the Parole corpus.
Alpino Dependency Treebank	Dutch	150,000 words	Freely downloadable	Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus.
NEGRA Corpus	German	20,000 sentences	Available free of charge to academics on completion of license agreement.	Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.
TIGER corpus	German	700,000 words	Available free of charge for research purposes on completion of license agreement.	German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch.
Icelandic Parsed Historical Corpus (IcePaHC)	Icelandic	1,000,000 words	Free download (LGPL)	Texts from 1150 through 2008!
TUT: Turin University Treebank	Italian	2,400 sentences	Free download.	Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts.
Floresta Sintá(c)tica	Portuguese	168,000 words hand-corrected; 1,000,000 words automatically parsed	Hand corrected part is free web download; automatically parsed part available through email contact	Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.
Talbanken05	Swedish	300,000 words	Free download	Resurrects and modernizes an early treebank from the 1970s.

Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
Syntactic Spanish Database (SDB) University of Santago de Compostela. 160,000 clauses / 1.5 million words.
CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus. (There's also a 100 sentence Chinese treebank at U. Maryland.)
LDC Korean Treebank.
Dublin-Essex Treebank project: Deriving Linguistic Resources from Treebanks.

Treebanks

CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.

Resources for Word Sense Disambiguation

The Senseval web site: Has a comprehensive selection of resources for WSD, including a good list of WSD data resources, but not yet the new SEMCOR.
Ted Pedersen's code: Includes various WSD systems.
SenseClusters: Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
Evocation WordNet synset similarity judgments: Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.

Literature

There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:

Entirely or mainly English

Alex: A Catalogue of Electronic Texts on the Internet: Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
Wiretap Electronic Text Archive: Extensive and good quality. Still in the gopher age, though.
The On-line Books Page: The index here only covers books in English, but there are lots of links to other collections of material in all languages.
Project Gutenberg: The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor's Project Gutenberg site.)
The Electronic Text Center of the University of Virginia: Large collection of SGML text, mainly in English, but also in other major languages.
Center for Electronic Texts in the Humanities: Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.
Oxford Electronic Text Library Editions: Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
Coreference annotated texts: From University of Woverhampton (R. Mitkov, C. Barbu et al.).

Acquisition data

CHILDES database.: Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.

SGML/XML

Robin Cover's SGML/XML Web Page: This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).
Information about the Text Encoding Initiative (TEI). (The Pizza Chef acts as a TEI tag set selector.)
Xaira: XML Aware Indexing and Retrieval Application. The successor of SARA.
Microsoft's XML page
W3C XML page.
The Corpus Encoding Standard.: An SGML instance designed for language engineering applications. Also the XML version.

Dictionaries

Dictionaries of subcategorization frames

The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).

COBUILD: Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.
LDOCE: Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.
OALD: Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.

Not exactly a dictionary, but other popular sources are:

Levin (1993): Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online.
English subcategorization evaluation resources: Gold standard data, from Cambridge University (Anna Korhonen)

See also COMLEX and CELEX available from the LDC.

Dictionaries of assorted languages on the web

The old version of Robert Beard's Web of Online Dictionaries long ago mutated into YourDictionary.com. I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.

Names

U.S. names with frequency information, are available from the Census Bureau.

SGML structured dictionaries

Cambridge International Dictionary of English and other products in SGML.

Lexical/morphological resources

English SENSEVAL Resources: Dictionary entries and tagged examples for 35 words.
ARIES Natural Language Tools: Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.

Courses, Syllabi, and other Educational Resources

"Techie"

Foundations of Statistical Natural Language Processing: Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze's new textbook, published in June 1999 by MIT Press. Read about courses using this book.
Corpus-based Linguistics: Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).
Statistical NLP: Theory and Practice: Christopher Manning's Spring 1996 CMU course materials.
John Lafferty and Roni Rosenfeld's Spring 1997 CMU course Language and Statistics.
Boston University (John D. Burger and Lynette Hirschman): A good course and web site, by the looks!
Draft of Data-Intensive Linguistics: By Chris Brew and Marc Moens.
Statistical Natural Language Processing course: By Joakim Nivre. Elsnet suported.
Short Course: Statistical Methods in NLP: By Philip Resnik
Linguist's Guide to Statistics by Brigitte Krenn and Christer Samuelsson.
Statistical and Corpora Based Methods for Processing Natural Languages: By Alon Itai, Technion Computer Science Department. (Don't read those old drafts of mine though ... get the real thing!)
CS 241 Statistical Models in Natural-Language Processing: Eugene Charniak, Brown University.
Michael Littman, Duke: 1997, 1998.

"Corpus Linguistics"

A tutorial on concordances and corpora by Cathy Ball
Tony Berber Sardinha's Corpus Linguistics course: Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)
Concordancing and corpus linguistics: Notes prepared by Phil Benson, Hong Kong University.
Computational Approaches to Collocations: Discussion of all the measures that have been used, and software for calculating them. By Evert and Krenn.

Mailing lists

Mailing lists that have information on these topics include:

Corpora: The main mailing list for info on corpus-based linguistics. Subscribe by sending the message: subscribe corpora to listserv@uib.no. Or if you want to subscribe with a different email address, send: subscribe corpora email-address (Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribe on the web.
Empiricist: The empiricist list appears to be defunct now. You used to send a "subscribe" message to empiricists-request@unagi.cis.upenn.edu.

Other stuff on the Web

General resources

NIST Human Language Technology programs: Including: TREC, TIDES, ACE, ....
Text summarization: Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.
PropositionBank @ UPenn
Statistical MT
Bookmarks for Corpus-based Linguists An extensive annotated collection by David Lee, aimed at linguistics more than NLP (includes web-searchable corpora and concordancing options).
HLTCentral: European site aiming to increase transfer of language technologies to the commercial market. News, etc.
Linguistic annotation: A description of formats for linguistic annotation by Steven Bird.
CTI Textual Studies, University of Oxford, Guide to Digital Resources: Lists text analysis tools, corpora, and other stuff.
U. Essex W3-Corpora: Lots of teaching material, links, and online corpora.
Computational Linguistics and NLP (Kenji Kita, Tokushima U.): A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also Software tools for NLP.
HLT Central: European Human Language Technology site
Survey of the State of the Art in Human Language Technology
ACL SIGLEX list of Lexical Resources
Online materials for a course on Learning Dynamical Systems at Brown University.: Lots of neat info.
Expert Advisory Group for Language Engineering Standards (EAGLES) home page: European standards organization.
Materials prepared for Michael Barlow's Corpus Linguistics course
Corpus Linguistics University of Birmingham
Chris Brew's Teaching Materials for statistical NLP: Not much there last time I looked; you might also try his home page.
Edinburgh LTG HelpDesk's FAQ: Many of the questions in the concern issues related to corpora and tagging.
Content Analysis Resources: Qualitative Text Analysis, Concordances, etc.
MT paper archive: Lots of papers, etc.

Information Retrieval

The SMART IR system
ACM SIGIR
Managing Gigabytes
TREC conference
Text-based Intelligent Systems (Bruce Croft)

Information Extraction/Wrapper Induction

Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel.
IE data sets: Updated versions (i.e., now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions.
Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999.
Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s).
Web IR and IE (Einat Amitay). Various links on IR and IE on the web.
Web question answering system (University of Michigan)
GATE: General Architecture for Text Engineering (Sheffield)
Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.

People's homepages

Home pages with something useful on them.

University of Texas at Austin Machine Learning Research Group
Steven Abney (until 1997)
Adam Berger: Various stuff on statistical MT and maximum entropy models
Alex Chengyu Fang: Provides a lot of info on the kinds of things they get up to at UCL, without actually giving you anything to play with yourself.

Societies/Journals

International Quantitative Linguistics Association/Journal of Quantitative Linguistics: Not very hip.
Association for Computational Linguistics/Computational Linguistics: Hipper; Source: https://nlp.stanford.edu

Pages

Friday, September 8, 2023

Installation:

Troubleshooting:

Saturday, June 3, 2023

Sunday, December 27, 2020

NMT

SMT

Sunday, February 16, 2020

Friday, January 3, 2020

Sunday, September 8, 2019

Problem

Solution

Discussion

Sunday, October 21, 2018

Instructions

Freely downloadable

Free, but getting them requires hassle

Freely downloadable

Free, but require registration

Usable by email or on the web, but not distributed freely

Not free

No longer available

Downloadable

Downloadable

Downloadable

Downloadable

Downloadable

Downloadable

Downloadable, but requires registration

Not yet classified

Downloadable

Free, but require registration

Unsure

English

Chinese

Multilingual

Bosnian

Czech

French

German

Russian

Slovene

Croatian

Spanish and Portuguese

Swedish

Entirely or mainly English

Dictionaries of subcategorization frames

Dictionaries of assorted languages on the web

Names

SGML structured dictionaries

"Techie"

"Corpus Linguistics"

Show IP and Country

Search This Blog

LinkedIn Profile

About Me

Useful Links

Blog Archive

Tags

2Performant

ProZ.com Jobs

TranslatorsCafe.com: Recent Translation Jobs

TranslatorsTown.com

Total Pageviews

Popular Posts

Subscribe To

SmartCAT

Wikipedia

Google Translate

2performant