1. Making Giza++
[1.1] Clone Giza++ from GitHub (https://github.com/moses-smt/giza-pp)
[1.2] Change the Giza++ makefile, found at giza-pp/GIZA++-v2/makefile. Edit the CFLAGS_OPT line as follows:
- CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE+CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE
[1.3]
Amend giza-pp/GIZA++-v2/model3.cpp for a case-insensitive file system.
Change line 321 as follows to prevent the .A3.final file from being
overwritten by the .a3.final file
- alignfile = Prefix + “.A3.” + number ;
+ alignfile = Prefix + “.VA3.” + number ;
[1.4] Remove references to tr1. Mac does not support tr1, but these modifications (taken from this blog) will make it compile just fine. Navigate to the /giza-pp directory and run:
perl -pi -w -e ‘s/<tr1\//</g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/using namespace std::tr1;//g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/std::tr1:://g;’ GIZA++-v2/* mkcls-v2/*
sed ‘36d’ mkcls-v2/mystl.h > mkcls-v2/mystl.h.tmp
sed ‘50d’ mkcls-v2/mystl.h.tmp > mkcls-v2/mystl.h
rm mkcls-v2/mystl.h.tmp
make
2. Creating a Bilingual Corpus
[2.1] To download a high-quality, professionally-translated bilingual corpora for most European languages, navigate to http://www.statmt.org/europarl/. I used the Spanish/English corpus, although any will do.
Once
you unzip, you will have two files, one in L1 (for me, in Spanish, file
extension “.es”) and one in L2 (English, file extension “.en”). For
this guide, I’m going to write all my output files to a folder called /europarl-es-en/. Right now, we’ve got:
/europarl-es-en/
europarl-v7.es-en.en
europarl-v7.es-en.es
3 . Preprocessing the Text
The
parallel corpus will have to be tokenized and stripped of all HTML
tagging to work with Giza++. You can do this using the moses decoder,
part of the moses SMT package.
[3.1] Download the moses decoder by cloning it from this github repo.
[3.2] Tokenize each of the language pairs separately with moses by running:
./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < europarl-v7.es-en.en > europarl-v7.es-en.tok.en./mosesdecoder/scripts/tokenizer/tokenizer.perl -l es < europarl-v7.es-en.es > europarl-v7.es-en.tok.es
When the tokenization process was complete, my /europarl-es-en/ folder contained the following files:
/europarl-es-en/
europarl-v7.es-en.en
europarl-v7.es-en.es
europarl.v7.es-en.tok.en
europarl.v7.es-en.tok.es
4. Creating the Input Files
In
order to run Giza, we’ll need three types of files: (1) a .vcb
containing a list of vocabulary words for each of the two langauges (2) a
.snt (sentence) file and (3) a .cooc, or concurrence file.
By convention with Giza, you pass in files in the source language first, and the target language second.
[4.1] To create the .vcb and .snt files run:
./plain2snt.out europarl.v7.es-en.tok.en europarl.v7.es-en.tok.es
After running, my /europarl-es-en/ folder contained the following files:
/europarl-es-en/
europarl-v7.es-en.en
europarl-v7.es-en.es
europarl.v7.es-en.tok.en
europarl.v7.es-en.tok.es
europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt
europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt
europarl-v7.es-en.tok.en.vcb
europarl-v7.es-en.tok.es.vcb
[4.2] Create
the .cooc file by running the following command. You can use either of
the two .snt files, regardless of which is your source and target
language.
./snt2cooc.out europarl-v7.es-en.tok.en.vcb europarl-v7.es-en.tok.es.vcb europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt > corp.cooc
5 . Run Giza++
As GIZA++ produces a lot of output files, I recommend creating a new output directory. When you’re ready run:
./GIZA++ -S europarl-v7.es-en.tok.en.vcb -T europarl-v7.es-en.tok.es.vcb -C europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt -CoocurrenceFile corp.cooc -outputpath [path to output directory]
Perhaps
the most interesting output file is the aforementioned .VA3.final,
which contains sentence pairs and word alignments from the source
language into the target language. For example, here’s some of my output
from a children’s book about anthropomorphic pumpkins:
Country pumpkins estaban cantando en el escenario.
NULL ({ 10 13 }) Country ({ 1 2 }) pumpkins ({ 3 }) were ({ 4 }) singing ({ 5 }) on ({ 6 }) the ({ 7 }) stage.
Source: https://medium.com