Sunday, September 8, 2019

Giza++ for Bilingual Sentence Alignment

1. Making Giza++

- CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE+CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE
- alignfile = Prefix + “.A3.” + number ;
+ alignfile = Prefix + “.VA3.” + number ;
perl -pi -w -e ‘s/<tr1\//</g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/using namespace std::tr1;//g;’ GIZA++-v2/* mkcls-v2/*
perl -pi -w -e ‘s/std::tr1:://g;’ GIZA++-v2/* mkcls-v2/*
sed ‘36d’ mkcls-v2/mystl.h > mkcls-v2/mystl.h.tmp
sed ‘50d’ mkcls-v2/mystl.h.tmp > mkcls-v2/mystl.h
rm mkcls-v2/mystl.h.tmp
make

2. Creating a Bilingual Corpus

/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es

3 . Preprocessing the Text

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < europarl-v7.es-en.en > europarl-v7.es-en.tok.en./mosesdecoder/scripts/tokenizer/tokenizer.perl -l es < europarl-v7.es-en.es > europarl-v7.es-en.tok.es
/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es
   europarl.v7.es-en.tok.en
   europarl.v7.es-en.tok.es

4. Creating the Input Files

./plain2snt.out europarl.v7.es-en.tok.en europarl.v7.es-en.tok.es
/europarl-es-en/
   europarl-v7.es-en.en
   europarl-v7.es-en.es
   europarl.v7.es-en.tok.en
   europarl.v7.es-en.tok.es
   europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt
   europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt
   europarl-v7.es-en.tok.en.vcb
   europarl-v7.es-en.tok.es.vcb
./snt2cooc.out europarl-v7.es-en.tok.en.vcb europarl-v7.es-en.tok.es.vcb europarl-v7.es-en.tok.en_europarl-v7.es-en.tok.es.snt > corp.cooc

5 . Run Giza++

./GIZA++ -S europarl-v7.es-en.tok.en.vcb -T europarl-v7.es-en.tok.es.vcb -C europarl-v7.es-en.tok.es_europarl-v7.es-en.tok.en.snt -CoocurrenceFile corp.cooc -outputpath [path to output directory]
Country pumpkins estaban cantando en el escenario.
NULL ({ 10 13 }) Country ({ 1 2 }) pumpkins ({ 3 }) were ({ 4 }) singing ({ 5 }) on ({ 6 }) the ({ 7 }) stage.
Source: https://medium.com