Sunday, September 8, 2019

Giza++ for Bilingual Sentence Alignment

1. Making Giza++

[1.1] Clone Giza++ from GitHub (https://github.com/moses-smt/giza-pp)

[1.2] Change the Giza++ makefile, found at giza-pp/GIZA++-v2/makefile. Edit the CFLAGS_OPT line as follows:

- CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE+CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE

[1.3] Amend giza-pp/GIZA++-v2/model3.cpp for a case-insensitive file system. Change line 321 as follows to prevent the .A3.final file from being overwritten by the .a3.final file

- alignfile = Prefix + “.A3.” + number ;
+ alignfile = Prefix + “.VA3.” + number ;

[1.4] Remove references to tr1. Mac does not support tr1, but these modifications (taken from this blog) will make it compile just fine. Navigate to the /giza-pp directory and run:

perl -pi -w -e ‘s/<tr1\//</g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/using namespace std::tr1;//g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/std::tr1:://g;’ GIZA++-v2/* mkcls-v2/*
sed ‘36d’ mkcls-v2/mystl.h > mkcls-v2/mystl.h.tmp
sed ‘50d’ mkcls-v2/mystl.h.tmp > mkcls-v2/mystl.h
rm mkcls-v2/mystl.h.tmp
make

2. Creating a Bilingual Corpus

[2.1] To download a high-quality, professionally-translated bilingual corpora for most European languages, navigate to http://www.statmt.org/europarl/. I used the Spanish/English corpus, although any will do.

Once you unzip, you will have two files, one in L1 (for me, in Spanish, file extension “.es”) and one in L2 (English, file extension “.en”). For this guide, I’m going to write all my output files to a folder called /europarl-es-en/. Right now, we’ve got:

/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es

3 . Preprocessing the Text

The parallel corpus will have to be tokenized and stripped of all HTML tagging to work with Giza++. You can do this using the moses decoder, part of the moses SMT package.

[3.1] Download the moses decoder by cloning it from this github repo.

[3.2] Tokenize each of the language pairs separately with moses by running:

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < europarl-v7.es-en.en > europarl-v7.es-en.tok.en./mosesdecoder/scripts/tokenizer/tokenizer.perl -l es < europarl-v7.es-en.es > europarl-v7.es-en.tok.es

When the tokenization process was complete, my /europarl-es-en/ folder contained the following files:

/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es
   europarl.v7.es-en.tok.en
   europarl.v7.es-en.tok.es

4. Creating the Input Files

In order to run Giza, we’ll need three types of files: (1) a .vcb containing a list of vocabulary words for each of the two langauges (2) a .snt (sentence) file and (3) a .cooc, or concurrence file.

By convention with Giza, you pass in files in the source language first, and the target language second.

[4.1] To create the .vcb and .snt files run:

./plain2snt.out europarl.v7.es-en.tok.en europarl.v7.es-en.tok.es

After running, my /europarl-es-en/ folder contained the following files:

/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es
   europarl.v7.es-en.tok.en
   europarl.v7.es-en.tok.es
   europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt
   europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt
   europarl-v7.es-en.tok.en.vcb
   europarl-v7.es-en.tok.es.vcb

[4.2] Create the .cooc file by running the following command. You can use either of the two .snt files, regardless of which is your source and target language.

./snt2cooc.out europarl-v7.es-en.tok.en.vcb europarl-v7.es-en.tok.es.vcb europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt > corp.cooc

5 . Run Giza++

As GIZA++ produces a lot of output files, I recommend creating a new output directory. When you’re ready run:

./GIZA++ -S europarl-v7.es-en.tok.en.vcb -T europarl-v7.es-en.tok.es.vcb -C europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt -CoocurrenceFile corp.cooc -outputpath [path to output directory]

Perhaps the most interesting output file is the aforementioned .VA3.final, which contains sentence pairs and word alignments from the source language into the target language. For example, here’s some of my output from a children’s book about anthropomorphic pumpkins:

Country pumpkins estaban cantando en el escenario.
NULL ({ 10 13 }) Country ({ 1 2 }) pumpkins ({ 3 }) were ({ 4 }) singing ({ 5 }) on ({ 6 }) the ({ 7 }) stage.
Source: https://medium.com

Compiled blog

Pages

Sunday, September 8, 2019

Giza++ for Bilingual Sentence Alignment

1. Making Giza++

2. Creating a Bilingual Corpus

3 . Preprocessing the Text

4. Creating the Input Files

5 . Run Giza++

Show IP and Country

Search This Blog

LinkedIn Profile

About Me

Useful Links

Blog Archive

Tags

2Performant

ProZ.com Jobs

TranslatorsCafe.com: Recent Translation Jobs

TranslatorsTown.com

Total Pageviews

Popular Posts

SmartCAT

Wikipedia

Google Translate

2performant

Compiled blog

Pages

Sunday, September 8, 2019

Giza++ for Bilingual Sentence Alignment

1. Making Giza++

2. Creating a Bilingual Corpus

3 . Preprocessing the Text

4. Creating the Input Files

5 . Run Giza++

Show IP and Country

Search This Blog

LinkedIn Profile

About Me

Useful Links

Blog Archive

Tags

2Performant

ProZ.com Jobs

TranslatorsCafe.com: Recent Translation Jobs

TranslatorsTown.com

Total Pageviews

Popular Posts

Subscribe To

SmartCAT

Wikipedia

Google Translate

2performant