Thursday, November 1, 2018

Toggle Keyboard Layout in Windows with AutoHotkey

Run AHK script and press Super+1 to switch keyboard layouts.

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.
; This should be replaced by whatever your native language is. See
; for the language identifiers list.
de := DllCall("LoadKeyboardLayout", "Str", "00010407", "Int", 1)
en := DllCall("LoadKeyboardLayout", "Str", "00010409", "Int", 1)
ro := DllCall("LoadKeyboardLayout", "Str", "00010418", "Int", 1)

w := DllCall("GetForegroundWindow")
pid := DllCall("GetWindowThreadProcessId", "UInt", w, "Ptr", 0)
l := DllCall("GetKeyboardLayout", "UInt", pid)
if (l = en)
PostMessage 0x50, 0, %ro%,, A
else if (l = de)
PostMessage 0x50, 0, %en%,, A
PostMessage 0x50, 0, %de%,, A

You-Get - Cli Downloader

Most of you may used or heard about Youtube-dl, a command line program to download videos from youtube and other 100+ websites. I just stumbled upon a similar utility named “You-Get”. It is also a CLI downloader written in Python. It allows you to download images, audios and videos from popular websites like Youtube, Facebook, Twitter and a lot more. Currently, there are 80+ websites are supported. Click here to read the full list of supported sites.
You-Get is not only a downloader, but also can stream the online videos in your media player. It even allows you to search for videos on google. Just pass the search term and You-Get will google it and download the most relevant videos. Another notable feature, it allows you to pause and resume the downloads. It is completely free, open source and cross-platform application that on Linux, Mac OS and Windows.

Install You-Get

Make sure you have installed the following prerequisites.
You-Get can be installed in many ways. The officially recommended method is using Pip package manager. If you haven’t installed PIP yet, refer the following link.
Please note that you must install Python 3 version of pip.
Now, run the following command to install You-Get:
$ pip3 install you-get
You can upgrade You-Get to its latest version using command:
$ pip3 install --upgrade you-get

Getting Started With You-Get

The usage is pretty much same as Youtube-dl utility.
Download Videos
To download a video, just run:
$ you-get
Sample output:
site: YouTube
title: The Last of The Mohicans by Alexandro Querevalú
 - itag: 22
 container: mp4
 quality: hd720
 size: 56.9 MiB (59654303 bytes)
 # download-with: you-get --itag=22 [URL]

Downloading The Last of The Mohicans by Alexandro Querevalú.mp4 ...
 100% ( 56.9/ 56.9MB) ├███████████████████████████████████████████████████████┤[1/1] 752 kB/s
You may want to view the details of the video before downloading. You-Get can do that for using “–info” or “-i” flag. This option will get you all available quality and formats of the given video.
$ you-get -i
$ you-get -info
Sample output would be:
site: YouTube
title: The Last of The Mohicans by Alexandro Querevalú
streams: # Available quality and codecs
 [ DASH ] ____________________________________
 - itag: 137
 container: mp4
 quality: 1920x1080
 size: 101.9 MiB (106816582 bytes)
 # download-with: you-get --itag=137 [URL]

- itag: 248
 container: webm
 quality: 1920x1080
 size: 90.3 MiB (94640185 bytes)
 # download-with: you-get --itag=248 [URL]

- itag: 136
 container: mp4
 quality: 1280x720
 size: 56.9 MiB (59672392 bytes)
 # download-with: you-get --itag=136 [URL]

- itag: 247
 container: webm
 quality: 1280x720
 size: 52.6 MiB (55170859 bytes)
 # download-with: you-get --itag=247 [URL]

- itag: 135
 container: mp4
 quality: 854x480
 size: 32.2 MiB (33757856 bytes)
 # download-with: you-get --itag=135 [URL]

- itag: 244
 container: webm
 quality: 854x480
 size: 28.0 MiB (29369484 bytes)
 # download-with: you-get --itag=244 [URL]

[ DEFAULT ] _________________________________
 - itag: 22
 container: mp4
 quality: hd720
 size: 56.9 MiB (59654303 bytes)
 # download-with: you-get --itag=22 [URL]
By default, You-Get will download the format marked with DEFAULT. If you don’t like that format or quality, you can pick any other format you like. Use the itag value given in the each format.
$ you-get --itag=244
Download Audios
The following command will download an audio from soundcloud website.
$ you-get ''
Type: MP3 (audio/mpeg)
Size: 2.58 MiB (2710046 Bytes)

 100% ( 2.6/ 2.6MB) ├███████████████████████████████████████████████████████┤[1/1] 983 kB/s
To view the details of the audio file, use -i flag.
$ you-get -i ''
Download Images
To download an image, run:
$ you-get
You-Get can also download all images from a web page.
$ you-get
Search Videos
You-Get doesn’t even a valid URL. You can just pass a random search terms. You-Get will google it and download the most relevant video based on your search string.
$ you-get 'Micheal Jackson'
Google Videos search:
Best matched result:
site: YouTube
title: Michael Jackson - Beat It (Official Video)
 - itag: 43
 container: webm
 quality: medium
 size: 29.4 MiB (30792050 bytes)
 # download-with: you-get --itag=43 [URL]

Downloading Michael Jackson - Beat It (Official Video).webm ...
 100% ( 29.4/ 29.4MB) ├███████████████████████████████████████████████████████┤[1/1] 2 MB/s
Watch Videos
You-Get can able to stream the online videos in your media player or browser, just without ads or comment section.
To watch videos in a media player, for example VLC, run the following command:
$ you-get -p vlc
$ you-get --player vlc
Similarly, to stream the videos in your browser, for example chromium, use:
$ you-get -p chromium

As you can see in the above screenshot, there is no ads, comment section. Just a plain page with the video.
Set path and file name for downloaded videos
By default, the videos will be downloaded in the current working directory with default video titles. You can, of course, change them as per your liking using –output-dir/-o flag to set the path and –output-filename/-O to set the name of the downloaded file.
$ you-get -o ~/Videos -O output.mp4
Pause and resume downloads
Press CTRL+C to cancel a download. A temporary .download file will be saved in the output directory. Next time you run you-get with the same arguments, the download process will resume from the last session.
In case the file is completely downloaded, the temporary .download extension will be gone, and you-get will just skip the download. To enforce re-downloading, use the –force/-f option.
For more details, refer the help section by running the following command.
$ you-get --help

Activate AHCI in Windows

Check the SATA mode in BIOS. AHCI mode was disabled and the machine was running on RAID. Windows 7 performs much better with the AHCI mode enabled. But that did not work as the AHCI drivers were not loaded at time of installation. The system would not recognize the HDD and I would get the Error: 'STOP 0x0000007B INACCESSABLE_BOOT_DEVICE'. The cause is Windows 7 will only install the AHCI drivers at installation if the mode is enabled at the time of installation. So boot in RAID mode and manually enabled the AHCI driver in the Windows Registry. This process is as follows:
Enable the AHCI driver in the registry before you change the SATA mode of the boot drive.
1.Exit all Windows-based programs.
2.Click Start, type regedit in the Start Search box, and then press ENTER.
3.If you receive the User Account Control dialog box, click Continue.
4.Locate and then click one of the following registry subkeys:
5.In the right pane, right-click Start in the Name column, and then click Modify.
6.In the Value data box, type 0, and then click OK.
7.On the File menu, click Exit to close Registry Editor.
The MIcrosoft KB link for this is here:
Restart and go into BIOS. Change the mode to AHCI and this time the boot was successful.

Error message occurs after you change the SATA mode of the boot drive

Applies to: Windows Vista BusinessWindows Vista EnterpriseWindows Vista Home Basic More


After you use the BIOS setup of a Windows 7-based computer or a Windows Vista-based computer to change the Serial Advanced Technology Attachment (SATA) mode of the boot drive to use either the Advanced Host Controller Interface (AHCI) specification or redundant array of independent disks (RAID) features, you receive the following error message when the computer is restarted:


This issue occurs if the disk driver in Windows 7 or Windows Vista is disabled. This driver must be enabled before you change the SATA/RAID mode of the boot drive.

This section, method, or task contains steps that tell you how to modify the registry. However, serious problems might occur if you modify the registry incorrectly. Therefore, make sure that you follow these steps carefully. For added protection, back up the registry before you modify it. Then, you can restore the registry if a problem occurs. For more information about how to back up and restore the registry, click the following article number to view the article in the Microsoft Knowledge Base:
322756 How to back up and restore the registry in Windows
To resolve this issue, enable the AHCI driver in the registry before you change the SATA mode of the boot drive. To do this, follow these steps: 
  1. Exit all Windows-based programs.
  2. Click Start, type regedit in the Start Search box, and then press ENTER.
  3. If you receive the User Account Control dialog box, click Continue.
  4. Locate and then click one of the following registry subkeys:
  5. In the pane on the right side, right-click Start in the Name column, and then click Modify.
  6. In the Value data box, type 0, and then click OK.
  7. On the File menu, click Exit to close Registry Editor.

During the Windows 7 or Windows Vista installation process, any unused storage drivers are disabled. This behavior speeds up the operating system's startup process. When you change the boot drive to a driver that is disabled, you must enable the new driver before you change the hardware configuration.

For example, assume that you install Windows Vista or Windows 7 on a computer that contains a controller that uses the Pciide.sys driver. Later, you change the SATA mode to AHCI. Therefore, the drive must now load the Msahci.sys driver. However, you must enable the Msahci.sys driver before you make this change.

This issue affects only the boot drive. If the drive that you change is not the boot drive, you do not experience this issue.

AHCI provides several features for SATA devices. These include hot plug functionality and power management functionality. For more information about the AHCI specification, go to the following Intel website: Microsoft provides third-party contact information to help you find technical support. This contact information may change without notice. Microsoft does not guarantee the accuracy of this third-party contact information.The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, regarding the performance or reliability of these products.

Sunday, October 21, 2018

Deep Learning Frameworks for Natural Language Processing in Python


Chainer, developed by the Japanese company Preferred Networks founded in 2014, is a powerful, flexible, and intuitive Python-based framework for neural networks that adopts a “define-by-run” scheme [1]. It stores the history of computation instead of programming logic. Chainer supports CUDA computation and multi-GPU. The framework released under the MIT License and is already applied for sentiment analysis, machine translation, speech recognition, question answering, and so on using different types of neural networks like convolutional networks, recurrent networks, and sequence to sequence models [2].


Deeplearning4j is a deep learning Java programming library, but it also has a Python API, Keras that will be described below. Distributed CPUs and GPUs, parallel training via iterative reduce, and micro-service architecture adaptation are its main features [3]. Vector space modeling enables the tool to solve text-mining problems. Parts of speech (PoS) tagging, dependency parsing, and word2vec for creating word embedding are discussed in the documentation.


Deepnl is another neural network Python library especially created for natural language processing by Giuseppe Attardi. It provides tools for part-of-speech tagging, named entity recognition, semantic role labeling (using convolutional neural networks [4]), and word embedding creation [5].


Dynet is a tool developed by Carnegie Mellon University and many others. It supports C++ and Python languages, runs on either CPU or GPU [6]. Dynet is based on the dynamic declaration of network structure [7]. This tool was used for creating outstanding systems for NLP problems including syntactic parsing, machine translation, morphological inflection, and many others.


Keras is a high-level neural-network based Python API that runs on CPU or GPU. It supports convolutional and recurrent networks and may run on top of TensorFlow, CNTK, or Theano. The main focus is to enable users fast experimentation [8]. There are many examples of Keras usage in the comparative table: classification, text generation and summarization, tagging, parsing, machine translation, speech recognition, and others.


Erick Rocha Fonseca’s nlpnet is also a Python library for NLP tasks based on neural networks. Convolutional networks enable users to perform part-of-speech tagging, semantic role labeling, and dependency parsing [9]. Most of the architecture is language independent [10].


OpenNMT is a Python machine translation tool that works under the MIT license and relies on the PyTorch library. The system demonstrates efficiency and state-of-the-art translation accuracy and is used by many translation providers [11]. It also incorporates text summarization, speech recognition, and image-to-text conversion blocks [12].


PyTorch is a fast and flexible neural network framework with an imperative paradigm. It builds neural networks on a tape-based autograd system and provides tensor computation with strong GPU acceleration [13]. Recurrent neural networks are mostly used in PyTorch for machine translation, classification, text generation, tagging, and other NLP tasks.


Developers called spaCy the fastest system in the world. They also affirm that their tool is the best way to prepare text for deep learning. Spacy works excellent with well-known Python libraries like gensim, Keras, TensorFlow, and scikit-learn. Matthew Honnibal, the author of the library, says that spaCy’s mission is to make cutting-edge NLP practical and commonly available [14]. Text classification, named entity recognition, part of speech tagging, dependency parsing, and other examples are presented in the comparative table.

Stanford’s CoreNLP

Stanford’s CoreNLP is a flexible, fast, and modern grammatical analysis tool that provides APIs for most common programming languages including Python. It also has an ability to run as a simple web service. As mentioned on the official website, the framework has a part-of-speech (POS) tagger, named entity recognizer (NER), parser, coreference resolution system, sentiment analysis, bootstrapped pattern learning, and open information extraction tools [15].


The Google Brain Team developed TensorFlow and released it in 2015 for research purposes. Now many companies like Airbus, Intel, IBM, Twitter and others use TensorFlow at production scale. The system architecture is flexible, so it is possible to perform computations on CPUs or GPUs. The main concept is flow graphs usage. Nodes of the graph reflect mathematical operations, while the edges represent multidimensional data arrays (tensors) communicated between them [16]. One of the most known of TensorFlow’s NLP application is Google Translate. Other applications are text classification and summarization, speech recognition, tagging, and so on.


As Tensorflow is a low-level API, many high-level APIs were created to run on top of it to make the user experience faster and more understandable. TFLearn is one of these tools that runs on CPU and GPU. It has a special graph visualization tool with details about weights, gradients, activations, and so on [17]. The library is already used for sentiment analysis, text generation, and named entity recognition. It lets users work with convolutional neural networks and recurrent neural networks (LSTM).


Theano is a numerical computation Python library that enables users to create their own machine learning models [18]. Many frameworks like Keras are built on top of Theano. There are tools for machine translation, speech recognition, word embedding, and text classification. Look at Theano’s applications in the table.


In this paper, we described neural network supporting Python tools for natural language processing. These tools are Chainer, Deeplearning4j, Deepnl, Dynet, Keras, Nlpnet, OpenNMT, PyTorch, SpaCy, Stanford’s CoreNLP, TensorFlow, TFLearn, and Theano. A table lets readers easily compare the frameworks discussed above.



10 Open-Source Tools/Frameworks for Artificial Intelligence


An open-source software library for Machine Intelligence.

TensorFlow™ is an open-source software library, which was originally developed by researchers and engineers working on the Google Brain Team. TensorFlow is for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
TensorFlow provides multiple APIs. The lowest level API — TensorFlow Core — provides you with complete programming control. The higher level APIs are built on top of TensorFlow Core. These higher level APIs are typically easier to learn and use than TensorFlow Core. In addition, the higher level APIs make repetitive tasks easier and more consistent between different users. A high-level API like tf.estimator helps you manage data sets, estimators, training, and inference.
The central unit of data in TensorFlow is the tensor. A tensor consists of a set of primitive values shaped into an array of any number of dimensions. A tensor's rank is its number of dimensions.
A few Google applications using tensor flow are:
RankBrain: A large-scale deployment of deep neural nets for search ranking on
Inception Image Classification Model: Baseline model and follow-on research into highly accurate computer vision models, starting with the model that won the 2014 Imagenet image classification challenge
SmartReply: Deep LSTM model to automatically generate email responses
Massively Multitask Networks for Drug Discovery: A deep neural network model for identifying promising drug candidates by Google in association with Stanford University.
On-Device Computer Vision for OCR: On-device computer vision model to do optical character recognition to enable real-time translation

Useful Links

Tensorflow home
Getting started

Apache SystemML

An optimal workplace for machine learning using big data.

SystemML, the machine-learning technology created at IBM, has reached one of the top-level project status at the Apache Software Foundation and it’s a flexible, scalable, machine learning system. Important characteristics are:
Algorithm customizability via R-like and Python-like languages.
Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC (Java Machine Learning Connector).
Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
SystemML considered as SQL for Machine learning. Latest version (1.0.0) of SystemML supports: Java 8+, Scala 2.11+, Python 2.7/3.5+, Hadoop 2.6+, and Spark 2.1+.
It can be run on top of Apache Spark, where it automatically scales your data line by line, determining whether your code should be run on the driver or an Apache Spark cluster. Future SystemML developments include additional deep learning with GPU capabilities such as importing and running neural network architectures and pre-trained models for training.
Java Machine Learning Connector (JMLC) for SystemML
The Java Machine Learning Connector (JMLC) API is a programmatic interface for interacting with SystemML in an embedded fashion. The primary purpose of JMLC is as a scoring API, where your scoring function is expressed using SystemML’s DML (Declarative Machine Learning) language. In addition to scoring, embedded SystemML can be used for tasks such as unsupervised learning (for example, clustering) in the context of a larger application running on a single machine.

Useful Links

SystemML home


A deep learning framework made with expression, speed, and modularity in mind.

The Caffe project was initiated by Yangqing Jia during his Ph.D. at UC Berkeley and then later developed by Berkeley AI Research (BAIR) and by community contributors. It mostly focusses on convolutional networks for computer vision applications. Caffe is a solid and popular choice for computer vision-related tasks and you can download many successful models made by Caffe users from the Caffe Model Zoo (link below) for out-of-the-box use.

Caffe Advantages

Expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.
Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back.
Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU.
Community: Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia.

Useful Links

Caffe home
Caffe user group
Tutorial presentation of the framework and a full-day crash course
Caffe Model Zoo

Apache Mahout

A distributed linear algebra framework and mathematically expressive Scala DSL

Mahout was designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed backend or can be extended to other distributed backends.
Mathematically Expressive Scala DSL
Support for Multiple Distributed Backends (including Apache Spark)
Modular Native Solvers for CPU/GPU/CUDA Acceleration
Apache Mahout currently implements areas including Collaborative filtering (CF), Clustering and Categorization


Taste CF. Taste is an open-source project for CF (collaborative filtering) started by Sean Owen on SourceForge and donated to Mahout in 2008.
Several Map-Reduce enabled clustering implementations including k-Means, fuzzy k-Means, Canopy, Dirichlet, and Mean-Shift.
Distributed Naive Bayes and Complementary Naive Bayes classification implementations.
Distributed fitness function capabilities for evolutionary programming.
Matrix and vector libraries.
Examples of all of the above algorithms.

Useful Links

Mahout home
Intro to Mahout by Grant Ingersoll


An open-source class library written in C++, which implements neural networks.
OpenNN (Open Neural Networks Library) was formerly known as Flood is based on the Ph.D. thesis of R. Lopez, "Neural Networks for Variational Problems in Engineering," at Technical University of Catalonia, 2008.

OpenNN implements data mining methods as a bundle of functions. These can be embedded in other software tools using an application programming interface (API) for the interaction between the software tool and the predictive analytics tasks. The main advantage of OpenNN is its high performance. It is developed in C++ for better memory management and higher processing speed and implements CPU parallelization by means of OpenMP and GPU acceleration with CUDA.
The package comes with unit testing, many examples, and extensive documentation. It provides an effective framework for the research and development of neural networks algorithms and applications. Neural Designer is a professional predictive analytics tool that uses OpenNN, which means that the neural engine of Neural Designer has been built using OpenNN.
OpenNN has been designed to learn from both datasets and mathematical models.


Function regression.
Pattern recognition.
Time series prediction.

Mathematical Models

Optimal control.
Optimal shape design.

Datasets and Mathematical Models

Inverse problems.

Useful Links 

OpenNN home 
OpenNN Artelnics GitHub
Neural Designer


An open-source machine learning library, a scientific computing framework, and a script language based on the Lua programming language.

  • a powerful N-dimensional array
  • lots of routines for indexing, slicing, transposing, …
  • amazing interface to C, via LuaJIT
  • linear algebra routines
  • neural network, and energy-based models
  • numeric optimization routines
  • Fast and efficient GPU support
  • Embeddable, with ports to iOS and Android backends
Torch is used by the Facebook AI Research Group, IBM, Yandex, and the Idiap Research Institute. It has been extended for use on Android and iOS and has been used to build hardware implementations for data flows like those found in neural networks.
Facebook has released a set of extension modules as open source software.
PyTorch is an open-source machine learning library for Python, used for applications such as natural language processing. It is primarily developed by Facebook's artificial intelligence research group, and Uber's "Pyro" software for probabilistic programming is built upon it.

Useful Links

Torch Home


An object-oriented neural network framework written in Java.

Neuroph can be used to create and train neural networks in Java programs. Neuroph provides Java class library as well as GUI tool easyNeurons for creating and training neural networks. Neuroph is lightweight Java neural network framework to develop common neural network architectures. It contains a well designed, open-source Java library with a small number of basic classes that correspond to basic NN concepts. It also has nice GUI neural network editor to quickly create Java neural network components. It has been released as open source under the Apache 2.0 license.
Neuroph's core classes correspond to basic neural network concepts like artificial neuron, neuron layer, neuron connections, weight, transfer function, input function, learning rule, etc. Neuroph supports common neural network architectures such as Multilayer perceptron with Backpropagation, Kohonen and Hopfield networks. All these classes can be extended and customized to create custom neural networks and learning rules. Neuroph has built-in support for image recognition.

Useful Links

Neuroph Home


The first commercial-grade, open-source, distributed deep-learning library written for Java and Scala.

Deeplearning4j aims to be cutting-edge plug and play and more convention than configuration, which allows for fast prototyping for non-researchers.
DL4J is customizable at scale.
DL4J can import neural net models from most major frameworks via Keras, including TensorFlow, Caffe and Theano, bridging the gap between the Python ecosystem and the JVM with a cross-team toolkit for data scientists, data engineers and DevOps. Keras is employed as Deeplearning4j's Python API.
Machine learning models are served in production with Skymind's model server.


  • Distributed CPUs and GPUs
  • Java, Scala and Python APIs
  • Adapted for micro-service architecture
  • Parallel training via iterative reduce
  • Scalable on Hadoop
  • GPU support for scaling on AWS


  • Deeplearning4J: Neural Net Platform
  • ND4J: Numpy for the JVM
  • DataVec: Tool for Machine Learning ETL Operations
  • JavaCPP: The Bridge Between Java and Native C++
  • Arbiter: Evaluation Tool for Machine Learning Algorithms
  • RL4J: Deep Reinforcement Learning for the JVM


Claiming as the world’s first open-source assistant and may be used in anything from a science project to an enterprise software application.

Mycroft runs anywhere — on a desktop computer, inside an automobile, or on a Raspberry Pi. This is open source software which can be freely remixed, extended, and improved. Mycroft may be used in anything from a science project to an enterprise software application.


OpenCog is a project that aims to build an open-source artificial intelligence framework

OpenCog is a diverse assemblage of cognitive algorithms, each embodying their own innovations — but what makes the overall architecture powerful is its careful adherence to the principle of cognitive synergy. OpenCog was originally based on the release in 2008 of the source code of the proprietary "Novamente Cognition Engine" (NCE) of Novamente LLC. The original NCE code is discussed in the PLN book (ref below). Ongoing development of OpenCog is supported by Artificial General Intelligence Research Institute (AGIRI), the Google Summer of Code project, and others.
  • A graph database that holds terms, atomic formulas, sentences and relationships as hypergraphs; giving them a probabilistic truth-value interpretation, dubbed the AtomSpace.
  • A satisfiability modulo theories solver, built in as a part of a generic graph query engine, for performing graph and hypergraph pattern matching (isomorphic subgraph discovery).
  • An implementation of a probabilistic reasoning engine based on probabilistic logic networks (PLN).
  • A probabilistic genetic program evolver called Meta-Optimizing Semantic Evolutionary Search, or MOSES, originally developed by Moshe Looks who is now employed at Google.
  • An attention allocation system based on economic theory, ECAN.
  • An embodiment system for interaction and learning within virtual worlds based in part on OpenPsi and Unity.
  • A natural language input system consisting of Link Grammar and RelEx, both of which employ AtomSpace-like representations for semantic and syntactic relations.
  • A natural language generation system called SegSim, with implementations NLGen and NLGen2.
  • An implementation of Psi-Theory for handling emotional states, drives, and urges, dubbed OpenPsi.
  • Interfaces to Hanson Robotics robots, including emotion modeling via OpenPsi.

Useful Links

OpenCog Home
OpenCog Wiki


Statistical natural language processing and corpus-based computational linguistics

* Tools: Machine Translation, POS Taggers, NP chunking, Sequence models, Parsers, Semantic Parsers/SRL, NER, Coreference, Language models, Concordances, Summarization, Other
* Corpora: Large collections, Particular languages, Treebanks, Discourse, WSD, Literature, Acquisition
* Dictionaries
* Lexical/morphological resources
* Courses, Syllabi, and other Educational Resources
* Mailing lists
* Other stuff on the Web: General, IR, IE/Wrappers, People, Societies


Machine Translation systems


* Building a baseline statistical phrase MT system
Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system: NAACL 2006 WMT or 2009 WMT.

Freely downloadable

* Moses
The most-used open-source phrase-based MT decoder. By Philip Koehn and many others.
* Phrasal
A Java phrase-based MT decoder, largely compatible with the core of Moses,with extra functionality for defining feature-rich ML models. By Daniel Cer, Michel Galley, Spence Green, and others.
* Joshua
A Java hierarchical MT decoder, largely based on the design of Hiero. By Chris Callison-Burch and others.
* Jane
A phrase-based MT decoder by the U. Aachen group.
* cdec
A primarily SCFG-based MT decoder by Chris Dyer and many others. C++.
* EGYPT system
System from 1999 JHU workshop. Mainly of historical interest.
* GIZA++ and mkcls
Franz Och. C++. GPL. Still often used for word alignment.
* Thot
Phrase-based model building kit
* Phramer
An Open-Source Java Statistical Phrase-Based MT Decoder
* Syntax Augmented Machine Translation via Chart Parsing
Andreas Zollmann and Ashish Venugopal

Free, but getting them requires hassle

* Pharaoh decoder
Philip Koehn, ISI.
Machine Translation Tool Kit. Deng and Byrne.

Part of Speech Taggers

Freely downloadable

* Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
* hunpos
An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.
* MBT: Memory-based Tagger
Based on TiMBL
* TreeTagger
A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.
* SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
* ACOPOST (formerly ICOPOST)
Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
* MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger
Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
* fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
* mu-TBL
An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
* QTAG Part of speech tagger
An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
* The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
* The venerable Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the Wikipedia page or you could try a reimplementation such as fnTBL.
* Original Xerox Tagger
A common lisp HMM tagger available by ftp.
* Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)

Free, but require registration

The ISSCO tagger. HMM tagger. Need to register to download.
* PoSTech Korean morphological analyzer and tagger
Online registration.
* TnT - A Statistical Part-of-Speech Tagger
Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.

Usable by email or on the web, but not distributed freely

* Memory-based tagger
From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
* Birmingham tagger
Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
* CLAWS tagger
The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn't seem to link to the C7 tagset.
* The AMALGAM tagger
The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
* Xerox XRCE MLTT Part Of Speech Taggers
Tags any of 14 languages (European and Arabic), online on the web.
* Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.

Not free

* Lingsoft
Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing There is an online demo.
* Conexor
Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
* Xerox
Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.
* Infogistics
Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

No longer available

The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.

NP chunking


* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
* Mark Greenwood's Noun Phrase Chunker
A Java reimplementation of Ramshaw and Marcus (1995).
* fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Generic sequence models


* CRF++
Generic CRF-based model in C++. Open source. By the author of YamCha.
* Carafe
Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner.
* FreeLing
A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.


Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.

Semantic Parsers


PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
* Shalmaneser
FrameNet-based by Katrin Erk.
* Tree Kernels in SVMlight by Alessandro Moschitti.
A general package, but it has particularly been used for SRL.

Named Entity Recognition


* Stanford Named Entity Recognizer
A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.
* LingPipe
Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Coreference (Anaphora) Resolution


* Stanford Deterministic Coreference Resolution System
Winner of CoNLL 2011 shared task, with subsequent improvements. Distributed as part of Stanford CoreNLP. Heeyoung Lee and others. Java. GPL.
* Reconcile
By Ves Stoyanov and others. Java. GPL.
* Illinois Coreference Package
Java. University of Illinois Research and Academic Use License.
* Berkeley Coreference Resolution
Greg Durrett et al. Mainly Scala. GPL.
A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components.
* Guitar
Java. GPL.

Language modeling toolkits


* IRSTLM Toolkit Compatible with SRILM, suitable for very large language models. LGPL. By Marcello Federico, Nicola Bertoldi et al.
* CMU-Cambridge Statistical Language Modeling toolkit

Downloadable, but requires registration

* The SRI Language Modeling toolkit
by Andreas Stolcke is another good system for building language models, freely available for research purposes.

Not yet classified

* Lextools
A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from:

Friendly concordancing and text analysis tools

* Wordsmith Tools (Mike Scott)
The thing to get if you are working in the Windows world.

Text summarization tools

* A prototype Java Summarisation applet (System Quirk)
A public domain portable multi-document summarization system. (Dragomir Radev and others.)



* Tilburg University's TiMBL
Tilburg's Memory Based Learner by Walter Daelemans et al. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications.
* splitta
Statistical sentence boundary detection by Dan Gillick.
* Time Expression taggers
TIMEX2 standard taggers (site at Mitre).
An open source Python package for NLP application development with tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
* Ted Pedersen's code
Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
* ISIP tools
The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
* Mem. A Perl implementation of Generalized and Improved Iterative Scaling
by Hugo WL ter Doest.
* Automorphology
A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
* Wordnet
Wordnet is available by ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an Italian/English/Spanish MultiWordNet and there's now a site for Global Wordnet. (See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, and WordNet Domains (coarse-grained sense topic classifications).)
* Penn XTAG project
A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
* Dan Melamed's Assorted Tools
A collection of various tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis Some stuff including a multilingual text editor is downloadable. MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.
* Naive Bayes algorithm
Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
Text Data Mining API from Lehigh University.
* Emdros: a text database engine for linguistic analysis and research
* Chasen
Japanese morphological analyzer. Descendent of JUMAN.

Free, but require registration

* Stuttgart's IMS Corpus Workbench (CWB)
A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
* Gate
University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
* MITRE's Alembic Workbench
A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.
* SNoW
SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).


a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein
The PennTools page collects information on a variety of NLP systems, many of which are available externally.


Large collections aimed at the NLP community

* LDC (Linguistic Data Consortium) and its catalogue by year.
Email: Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available).
* European Language Resources Association and its catalogue.
Distribution agency is ELDA. Rapidly growing collection of materials in European languages.
* ICAME (International Computer Archive of Modern English)
Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message help to, by ftp to Also, manuals for these corpora.
* Reuters @ NIST
Reuters corpora are now distributed by NIST.
TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
* CLR (Consortium for Lexical Research)
Email: Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to Their catalog is available as a postscript file.
* OTA (Oxford Text Archive)
Provides mainly literary texts. Has a bright new web site. Email: Most materials are available on the web or by anonymous ftp to Some require negotiations with the providers.
* Leipzig Corpora Collection
Sentence collections in MySQL database for 17 mainly European languages.
* BNC (British National Corpus)
A 100 million word corpus of British English. You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition.
* European Corpus Initiative Multilingual Corpus I (ECI/MCI)
A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
* Survey of English Usage
At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).
* International Corpus of English (ICE)
Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.
* Corpora held by Lancaster University
This link provides its own annotations.
* The European Language Activity Network
Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
* Talkbank.
Rich video and transcripts.

Particular languages


English language corpora available from the sites above are not repeated here.
* Corpora by Geoffrey Sampson's team
The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).
* Michigan Corpus of Academic Spoken English (MICASE). 1.7 million words from 1997-2001.
* Penn-Helsinki Parsed Corpus of Middle English
A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
* Corpus of Professional, Spoken American-English (CPSA)
2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
* Lancaster Parsed Corpus
* Dialogue Diversity Corpus (Bill Mann)
* American National Corpus


English language corpora available from the sites above are not repeated here.
* The Lancaster Corpus of Mandarin Chinese (LCMC)
By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.


* JRC-Acquis
A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.
Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.
An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
* World Health Organization Computer Assisted Translation page.
Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page.)
* Searchable Canadian Hansard French-English parallel texts (1986-1993)
From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
* European Union web server
Parallel text in all EU languages. (In particular try European legislation.)
Parallel and other text in central and eastern european languages.


* The Oslo Corpus of Bosnian Texts.


* Parallel Czech-English
Literature translations in Czech and English
* Czech National Corpus project: SYN2000
100 million words of contemporary Czech.


* Association des Bibliophiles Universels
Various French literary works.
* American and French Research on the Treasury of the French Language (ARTFL)
150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).


* COSMAS Corpus
Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection
* NEGRA Corpus
Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.


* Russian National Corpus
150 million words, 5 million words POS-tagged, some in dependency treebank.
* Library of Russian Internet Libraries
Various literary works.


* Slovene-English parallel corpus
1 M words, free to download + on-line concordances.
* Coming soon: Slovene reference corpus of 100 M words


* Croatian National Corpus
100 M words

Spanish and Portuguese

* TychoBrahe Parsed Corpus of Historical Portuguese
Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
* Information about Mark Davies' collection of (mainly historical Spanish and Portuguese.
It's not clear what their availability is.
* The CUMBRE corpus. Contact Professor Aquilino Sánchez
* The CRATER Spanish corpus
Morphosyntactically tagged telecommunication manuals) is available by ftp.
* Corpus resources for Portuguese
In total about 70 million words, available free, from various sources (newswire, etc.)
* Folha de S. Paulo newspaper
4 annual CDROMs with full text.
Portuguese-English parallel corpus. (In general, various resources at Linguateca site.
* See also under ELRA, above.


* Spraakdata, Department of Swedish, Göteborgs University.
Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.


Name Language Size Availability Comments
Penn Treebank US English 2 million + words Available (distributed by LDC) 1 million WSJ, 1 million speech, surface syntax (1970s TG)
BLLIP WSJ corpus US English 30 million words Available (distributed by LDC) WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking
ICE-GB UK English 1 million words (83,394 sentences) Available; c. 500 pounds British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material.
Bulgarian Treebank Bulgarian n/a POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement) An under construction Bulgarian HPSG treebank.
Penn Chinese Treebank Chinese 100,000 words Available (LDC) Based on Xinhua news articles. 1980s-style GB syntax.
The Prague Dependency Treebank 1.0 Czech 500,000 words Free on completion of license agreement (available through LDC). Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines.
Danish Dependency Treebank 1.0 Danish 100,000 words Available free under the GPL. Built on a portion of the Parole corpus.
Alpino Dependency Treebank Dutch 150,000 words Freely downloadable Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus.
NEGRA Corpus German 20,000 sentences Available free of charge to academics on completion of license agreement. Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.
TIGER corpus German 700,000 words Available free of charge for research purposes on completion of license agreement. German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch.
Icelandic Parsed Historical Corpus (IcePaHC) Icelandic 1,000,000 words Free download (LGPL) Texts from 1150 through 2008!
TUT: Turin University Treebank Italian 2,400 sentences Free download. Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts.
Floresta Sintá(c)tica Portuguese 168,000 words hand-corrected; 1,000,000 words automatically parsed Hand corrected part is free web download; automatically parsed part available through email contact Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.
Talbanken05 Swedish 300,000 words Free download Resurrects and modernizes an early treebank from the 1970s.
* Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
* Syntactic Spanish Database (SDB) University of Santago de Compostela. 160,000 clauses / 1.5 million words.
* CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus. (There's also a 100 sentence Chinese treebank at U. Maryland.)
* LDC Korean Treebank.
* Dublin-Essex Treebank project
Deriving Linguistic Resources from Treebanks.


CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.

Resources for Word Sense Disambiguation

* The Senseval web site
Has a comprehensive selection of resources for WSD, including a good list of WSD data resources, but not yet the new SEMCOR.
* Ted Pedersen's code
Includes various WSD systems.
* SenseClusters
Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
* Evocation WordNet synset similarity judgments
Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.


There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:

Entirely or mainly English

* Alex: A Catalogue of Electronic Texts on the Internet
Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
* Wiretap Electronic Text Archive
Extensive and good quality. Still in the gopher age, though.
* The On-line Books Page
The index here only covers books in English, but there are lots of links to other collections of material in all languages.
* Project Gutenberg
The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor's Project Gutenberg site.)
* The Electronic Text Center of the University of Virginia
Large collection of SGML text, mainly in English, but also in other major languages.
* Center for Electronic Texts in the Humanities
Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.
* Oxford Electronic Text Library Editions
Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
* Coreference annotated texts
From University of Woverhampton (R. Mitkov, C. Barbu et al.).

Acquisition data

* CHILDES database.
Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.


* Robin Cover's SGML/XML Web Page
This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).
* Information about the Text Encoding Initiative (TEI). (The Pizza Chef acts as a TEI tag set selector.)
* Xaira
XML Aware Indexing and Retrieval Application. The successor of SARA.
* Microsoft's XML page
* W3C XML page.
* The Corpus Encoding Standard.
An SGML instance designed for language engineering applications. Also the XML version.


Dictionaries of subcategorization frames

The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).
Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.
Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.
Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.
Not exactly a dictionary, but other popular sources are:
* Levin (1993)
Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online.
* English subcategorization evaluation resources
Gold standard data, from Cambridge University (Anna Korhonen)
See also COMLEX and CELEX available from the LDC.

Dictionaries of assorted languages on the web

* The old version of Robert Beard's Web of Online Dictionaries long ago mutated into I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.


U.S. names with frequency information, are available from the Census Bureau.

SGML structured dictionaries

* Cambridge International Dictionary of English and other products in SGML.

Lexical/morphological resources

* English SENSEVAL Resources
Dictionary entries and tagged examples for 35 words.
* ARIES Natural Language Tools
Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.

Courses, Syllabi, and other Educational Resources


* Foundations of Statistical Natural Language Processing
Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze's new textbook, published in June 1999 by MIT Press. Read about courses using this book.
* Corpus-based Linguistics
Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).
* Statistical NLP: Theory and Practice
Christopher Manning's Spring 1996 CMU course materials.
* John Lafferty and Roni Rosenfeld's Spring 1997 CMU course Language and Statistics.
* Boston University (John D. Burger and Lynette Hirschman)
A good course and web site, by the looks!
* Draft of Data-Intensive Linguistics
By Chris Brew and Marc Moens.
* Statistical Natural Language Processing course
By Joakim Nivre. Elsnet suported.
* Short Course: Statistical Methods in NLP
By Philip Resnik
* Linguist's Guide to Statistics by Brigitte Krenn and Christer Samuelsson.
* Statistical and Corpora Based Methods for Processing Natural Languages
By Alon Itai, Technion Computer Science Department. (Don't read those old drafts of mine though ... get the real thing!)
* CS 241 Statistical Models in Natural-Language Processing
Eugene Charniak, Brown University.
* Michael Littman, Duke: 1997, 1998.

"Corpus Linguistics"

* A tutorial on concordances and corpora by Cathy Ball
* Tony Berber Sardinha's Corpus Linguistics course
Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)
* Concordancing and corpus linguistics
Notes prepared by Phil Benson, Hong Kong University.
* Computational Approaches to Collocations
Discussion of all the measures that have been used, and software for calculating them. By Evert and Krenn.

Mailing lists

Mailing lists that have information on these topics include:
* Corpora
The main mailing list for info on corpus-based linguistics. Subscribe by sending the message:
subscribe corpora
to Or if you want to subscribe with a different email address, send:
subscribe corpora email-address
(Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribe on the web.
* Empiricist
The empiricist list appears to be defunct now. You used to send a "subscribe" message to

Other stuff on the Web

General resources

* NIST Human Language Technology programs
Including: TREC, TIDES, ACE, ....
* Text summarization
Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.
* PropositionBank @ UPenn
* Statistical MT
* Bookmarks for Corpus-based Linguists An extensive annotated collection by David Lee, aimed at linguistics more than NLP (includes web-searchable corpora and concordancing options).
* HLTCentral
European site aiming to increase transfer of language technologies to the commercial market. News, etc.
* Linguistic annotation
A description of formats for linguistic annotation by Steven Bird.
* CTI Textual Studies, University of Oxford, Guide to Digital Resources
Lists text analysis tools, corpora, and other stuff.
* U. Essex W3-Corpora
Lots of teaching material, links, and online corpora.
* Computational Linguistics and NLP (Kenji Kita, Tokushima U.)
A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also Software tools for NLP.
* HLT Central
European Human Language Technology site
* Survey of the State of the Art in Human Language Technology
* ACL SIGLEX list of Lexical Resources
* Online materials for a course on Learning Dynamical Systems at Brown University.
Lots of neat info.
* Expert Advisory Group for Language Engineering Standards (EAGLES) home page
European standards organization.
* Materials prepared for Michael Barlow's Corpus Linguistics course
* Corpus Linguistics University of Birmingham
* Chris Brew's Teaching Materials for statistical NLP
Not much there last time I looked; you might also try his home page.
* Edinburgh LTG HelpDesk's FAQ
Many of the questions in the concern issues related to corpora and tagging.
* Content Analysis Resources
Qualitative Text Analysis, Concordances, etc.
* MT paper archive
Lots of papers, etc.

Information Retrieval

* The SMART IR system
* Managing Gigabytes
* TREC conference
* Text-based Intelligent Systems (Bruce Croft)

Information Extraction/Wrapper Induction

* Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel.
* IE data sets
Updated versions (i.e., now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions.
* Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
* RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999.
* Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s).
* Web IR and IE (Einat Amitay). Various links on IR and IE on the web.
* Web question answering system (University of Michigan)
* GATE: General Architecture for Text Engineering (Sheffield)
* Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.

People's homepages

Home pages with something useful on them.
* University of Texas at Austin Machine Learning Research Group
* Steven Abney (until 1997)
* Adam Berger
Various stuff on statistical MT and maximum entropy models
* Alex Chengyu Fang
Provides a lot of info on the kinds of things they get up to at UCL, without actually giving you anything to play with yourself.


* International Quantitative Linguistics Association/Journal of Quantitative Linguistics
Not very hip.
* Association for Computational Linguistics/Computational Linguistics