Wednesday, December 27, 2017

Open source translation tools to localize your project

Localization plays a central role in the ability to customize an open source project to suit the needs of users around the world. Besides coding, language translation is one of the main ways people around the world contribute to and engage with open source projects.
There are tools specific to the language services industry (surprised to hear that's a thing?) that enable a smooth localization process with a high level of quality. Categories that localization tools fall into include:
  • Computer-assisted translation (CAT) tools
  • Machine translation (MT) engines
  • Translation management systems (TMS)
  • Terminology management tools
  • Localization automation tools
The proprietary versions of these tools can be quite expensive. A single license for SDL Trados Studio (the leading CAT tool) can cost thousands of euros, and even then it is only useful for one individual and the customizations are limited (and psst, they cost more, too). Open source projects looking to localize into many languages and streamline their localization processes will want to look at open source tools to save money and get the flexibility they need with customization. I've compiled this high-level survey of many of the open source localization tool projects out there to help you decide what to use.

Computer-assisted translation (CAT) tools

omegat_cat.png

OmegaT CAT tool
OmegaT CAT tool. Here you see the translation memory (Fuzzy Matches) and terminology recall (Glossary) features at work. OmegaT is licensed under the GNU Public License version 3+.
CAT tools are a staple of the language services industry. As the name implies, CAT tools help translators perform the tasks of translation, bilingual review, and monolingual review as quickly as possible and with the highest possible consistency through reuse of translated content (also known as translation memory). Translation memory and terminology recall are two central features of CAT tools. They enable a translator to reuse previously translated content from old projects in new projects. This allows them to translate a high volume of words in a shorter amount of time while maintaining a high level of quality through terminology and style consistency. This is especially handy for localization, as text in a lot of software and web UIs is often the same across platforms and applications. CAT tools are standalone pieces of software though, requiring translators that use them to work locally and merge to a central repository.
Tools to check out:

Machine translation (MT) engines

apertium_screenshot.png
MT engines automate the transfer of text from one language to another. MT is broken up into three primary methodologies: rules-based, statistical, and neural (which is the new player). The most widespread MT methodology is statistical, which (in very brief terms) draws conclusions about the interconnectedness of a pair of languages by running statistical analyses over annotated bilingual corpus data using n-gram models. When a new source language phrase is introduced to the engine for translation, it looks within its analyzed corpus data to find statistically relevant equivalents, which it produces in the target language. MT can be useful as a productivity aid to translators, changing their primary task from translating a source text to a target text to post-editing the MT engine's target language output. I don't recommend using raw MT output in localizations, but if your community is trained in the art of post-editing, MT can be a useful tool to help them make large volumes of contributions.
Tools to check out:

Translation management systems (TMS)

mozilla_pontoon.png

Mozilla's Pontoon translation management system user interface
Mozilla's Pontoon translation management system user interface. With WYSIWYG editing, you can translate content in context and simultaneously perform translation and quality assurance. Pontoon is licensed under the BSD 3-clause New or Revised License.
TMS tools are web-based platforms that allow you to manage a localization project and enable translators and reviewers to do what they do best. Most TMS tools aim to automate many manual parts of the localization process by including version control system (VCS) integrations, cloud services integrations, project reporting, as well as the standard translation memory and terminology recall features. These tools are most amenable to community localization or translation projects, as they allow large groups of translators and reviewers to contribute to a project. Some also use a WYSIWYG editor to give translators context for their translations. This added context improves translation accuracy and cuts down on the amount of time a translator has to wait between doing the translation and reviewing the translation within the user interface.
Tools to check out

Terminology management tools

baseterm_term_entry_example.png

Brigham Young University's BaseTerm tool
Brigham Young University's BaseTerm tool displays the new-term entry dialogue window. BaseTerm is licensed under the Eclipse Public License.
Terminology management tools give you a GUI to create terminology resources (known as termbases) to add context and ensure translation consistency. These resources are consumed by CAT tools and TMS platforms to aid translators in the process of translation. For languages in which a term could be either a noun or a verb based on the context, terminology management tools allows you to add metadata for a term that labels its gender, part of speech, monolingual definition, as well as context clues. Terminology management is often an underserved, but no less important, part of the localization process. In both the open source and proprietary ecosystems, there are only a small handful of options available.
Tools to check out

Localization automation tools

okapi_framework.jpg

Ratel and Rainbow components of the Okapi Framework
The Ratel and Rainbow components of the Okapi Framework. Photo courtesy of the Okapi Framework. The Okapi Framework is licensed under the Apache License version 2.0.
Localization automation tools facilitate the way you process localization data. This can include text extraction, file format conversion, tokenization, VCS synchronization, term extraction, pre-translation, and various quality checks over common localization standard file formats. In some tool suites, like the Okapi Framework, you can create automation pipelines for performing various localization tasks. This can be very useful for a variety of situations, but their main utility is in the time they save by automating many tasks. They can also move you closer to a more continuous localization process.
Tools to check out
Source: https://opensource.com

Change default code page of Windows console to UTF-8

Running chcp 65001 in the command prompt prior to use of any tools helps but is there any way to set is as default code page?

Changing HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP value to 65001 appear to make the system unable to boot in my case.
Proposed change of HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun to @chcp 65001>nul

Batch file .bat

@ECHO OFF
REM change CHCP to UTF-8
CHCP 65001
CLS
 
Saved at C:\Windows\System32 as switch.bat. Create a link for cmd.exe on the desktop.
In the properties of cmd, changed the destination to: C:\Windows\System32\cmd.exe /k switch


Note that it will print Active code page: 65001 to stdout. So if you are doing something like CHCP 65001 && mycommand.exe then you'll get the codepage printed out at the start. You need to CHCP 65001 >nul && mycommand.exe

Reg file:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe]
"CodePage"=dword:fde9
  1. Value must be in hex
  2. Top line must be included exactly as is
  3. HKEY_CURRENT_USER cannot be abbreviated
  4. dword cannot be omitted

Command Prompt:
REG ADD HKCU\Console\%SystemRoot^%_system32_cmd.exe /v CodePage /t REG_DWORD /d 65001
  1. Value can be in dec or hex
  2. %SystemRoot% must be escaped
  3. REG_DWORD cannot be omitted

PowerShell:
New-Item -ErrorAction Ignore HKCU:\Console\%SystemRoot%_system32_cmd.exe
Set-ItemProperty HKCU:\Console\%SystemRoot%_system32_cmd.exe CodePage 65001
  1. Value can be in dec or hex
  2. -Type DWord is assumed with PowerShell 3+
  3. Can use ni -> New-Item
  4. Can use sp -> Set-ItemProperty
  5. Can use -ea 0 -> -ErrorAction Ignore

Cygwin:
regtool add '\HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe'
regtool set '\HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe\CodePage' 65001
  1. Value can be in dec or hex
  2. Can use / -> \
  3. Can use HKCU -> HKEY_CURRENT_USER
  4. Can use user -> HKEY_CURRENT_USER

Thursday, December 7, 2017

MemoQ and VBS

A simple snippet of script code for exporting a TM to TMX has been circulating for a while now as a VBA macro to run from Microsoft Word, for example. Personally, I object to running something in MS Word that has nothing to do with that program, so I recoded it as an executable script and added a few extra tweaks:

tmFolder = InputBox("Which TM should be exported?")
if tmFolder <> "" then

' The path where all my memoQ TMs are stored
standardTMpath = "C:\ProgramData\MemoQ\Translation Memories\"

'build absolute paths
outputTMXfile =  ".\" & tmFolder & "_" & date() & ".tmx"
tmFolder = standardTMpath & tmFolder

Set fact = CreateObject("MemoQ.ClientService.ServiceFactoryScripting")
Set tmService = fact.CreateTMService
Set createTMRes = tmService.ExportToTMX(tmFolder, outputTMXfile)
if createTMRes.Success = False then
   MsgBox createTMRes.ShortErrorMessage
else
   MsgBox "The TM was exported."
end if

end if
Just copy that script into a text file, rename the extension to *.vbs and you have a double-clickable script to export a TM without opening memoQ. The TMX export is placed in the same folder where the script is executed and tagged with the date of the export.

Encouraged by this little test, I went on to tackle one of my pet peeves: the lack of muliti-file import capabilities in memoQ TMs. Trados Studio has no problem importing a folder full of TMX files to a TM in one go, but with memoQ one must import each TMX file - painfully - one at a time. The pain is felt quite severely if, for example, you are a former OmegaT user with a legacy of 300+ TMX files from your old projects.

So I wrote another little script which allows me to drag and drop any number of TMX files onto its icon and have them all import to the specified TM. This is a rather crude example for just one set of had-coded sublanguages (DE-DE and EN-US). The API currently does not allow sublanguages to be ignored for the import. Adapt this to use your relevant sublanguages if you like:
'
' memoQ TMX import macro
' drag & drop TMX files onto the script icon
'
tmFolder = InputBox("To which TM should the TMX file(s) be imported?")
If tmFolder <> "" Then

' The path where all my memoQ TMs are stored
standardTMpath = "E:\Working databases\MemoQ\TMs\"

'build absolute path
tmFolder = standardTMpath & tmFolder

' Create the ServiceFactoryScripting object and TM service
Set objSFS = CreateObject("MemoQ.ClientService.ServiceFactoryScripting")
Set svcTM = objSFS.CreateTMService

' Set import options parameters
Set objImportOptions = CreateObject("MemoQ.ClientService.TMImportOptionsScripting")
objImportOptions.TMXSourceLanguageCode = "ger-de" 
objImportOptions.TMXTargetLanguageCode = "eng-us"  
objImportOptions.TradosImportOptimization = False
objImportOptions.DefaultValues = Null
objImportOptions.DefaultsOverrideInput = False

  Set objArgs = WScript.Arguments
  For I = 0 To objArgs.Count - 1
    tmxfile = objArgs(I)
    logFileName = tmFolder & "_" & date() & "_" & "importlog." & I & ".txt"

    Set returnvalue = svcTM.ImportFromTMX(tmFolder, tmxfile, objImportOptions, logFileName)
    If returnvalue.Success = False Then
        MsgBox returnvalue.ShortErrorMessage
    Else
        MsgBox "No errors in the import of " & tmxfile & ". See the log file at: " & logFileName
    End If
  Next

end if

Source: http://www.translationtribulations.com

Tuesday, December 5, 2017

Google Translate with Internet Explorer COM Object in AutoHotKey


Google Translate Query Parameters

  • HTTP request

    POST https://translation.googleapis.com/language/translate/v2

    • sl - source language code (auto for autodetection)tl - translation language
    • q - source text / word
    • ie - input encoding (a guess)
    • oe - output encoding (a guess)
    • dt - may be included more than once and specifies what to return in the reply.
Here are some values for dt. If the value is set, the following data will be returned:
  • t - translation of source text
  • at - alternate translations
  • rm - transcription / transliteration of source and translated texts
  • bd - dictionary, in case source text is one word (you get translations with articles, reverse translations, etc.)
  • md - definitions of source text, if it's one word
  • ss - synonyms of source text, if it's one word
  • ex - examples
  • rw - See also list.
  • client t probably represents the standalone google translate web app (as opposed to a mobile app, or the widget that pops up if you google search "translate")
  • sl is source language
  • tl is translate language (the language you want to translate into)
  • srcrom seems to be present when the source text has no spelling suggestions
sl:auto
tl:sk
hl:sk //language of the interface (default:en, you can try xx-bork or xx-hacker)
dt:ld - ?
dt:qc - ?
dt:rm - ?
dt:ss - ?
dt:sw - ?
ie:UTF-8 // encoding of the input (default: utf-8)
oe:UTF-8 // encoding of the output, the results (default: utf-8)
otf:1 - ?
srcrom:1 - ?
ssel:3 - ?
tsel:0 - ?
https://translate.google.com/translate_a/single?client=t&sl=en&tl=it&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&kc=1&tk=685310.807246&q=c
q string
Required The input text to translate. Repeat this parameter to perform translation operations on multiple text inputs.
target string
Required The language to use for translation of the input text, set to one of the language codes listed in Language Support.
format string
The format of the source text, in either HTML (default) or plain-text. A value of html indicates HTML and a value of text indicates plain-text.
source string
The language of the source text, set to one of the language codes listed in Language Support. If the source language is not specified, the API will attempt to detect the source language automatically and return it within the response.
model string
The translation model. Can be either base to use the Phrase-Based Machine Translation (PBMT) model, or nmt to use the Neural Machine Translation (NMT) model. If omitted, then nmt is used.
If the model is nmt, and the requested language translation pair is not supported for the NMT model, then the request is translated using the base model.

string
key A valid API key to handle requests for this API. If you are using OAuth 2.0 service account credentials (recommended), do not supply this parameter.
Yandex does not provides alignment information either, but unofficially it does, even in the paid service.If you send "options=4" in the query string of the url it returns alignment information.

Monday, December 4, 2017

ATTACHING A LOCAL TERMBASE IN MULTERM



LOADING XDT FILES TO CREATE THE TERMBASE
Download the zipped file from the FTP site (or save the copy received via e-mail). Unzip the files (XDT and XML) of the termbase to your local drive.

In MultiTerm, go to Termbase > Create Termbase. Browse to the folder you want to save the termbase in, give it a name and click Save. MultiTerm will take you through the Termbase Wizard:

·         Step 1: Select Load an existing termbase definition file and click Browse. Navigate to the folder where you have saved the unzipped files, select the XDT file and click Open.
·         Step 2: Enter a descriptive name for the new termbase.
·         Step 3 and 4: Do not change any settings for the index fields, descriptive fields and entry structure.
·         Click Finish to close the wizard.

You will now need to import all entries form the XML file.
IMPORTING XML FILES
MultiTerm iX / 7 / 2007
Go to Termbase > Import Entries. Select Default import definition and click Process. Click Browse and navigate to the folder where you have saved the unzipped files, select the XML file and click OK.

Make sure you select Fast import (import file is fully compliant with MultiTerm XML) in the import wizard and click Next until the import starts.
MultiTerm 2009
Select the Catalog view from the bottom left corner and right-click Import. Select Process. Click Browse and navigate to the folder where you have saved the unzipped files, select the XML file and click OK.

Make sure you select Fast import (import file is fully compliant with MultiTerm XML) in the import wizard and click Next until the import starts.