Tuesday, June 4, 2013

Regular expressions in MemoQ

Regular expressions are a powerful means for finding character sequences in text. In memoQ, they are used to define segmentation rules and auto-translation rules.
Finding character sequences is a familiar task to everyone who has used a word processor or text editor before. The Find or Search dialog serves this purpose – if you search for ‘cat’, your editor will highlight words (or parts of words) such as ‘cat’, ‘cats’, or even ‘sophisiticated’.
Regular expressions, however, provide a lot more freedom to tell the computer what you are looking for. You can identify sequences such as a letter ‘a’, followed by two or three letters ‘c’; a number of letters followed by one or more digits; or either of the words ‘cat’, ‘dog’ or ‘mouse’ – and much more. After reading through this page and experimenting with the examples, you’ll know exactly how.
Note: The term regular expression comes from the mathematical theory on which this pattern matching method is based. It is often abbreviated as regexp or regex – here we’ll use regex, or in the plural, regexes.
Literal and Meta
In a word processor’s old-school Find function every character is interpreted literally. If you search for ‘Yes? No…’ it will highlight ‘Yes? No…’ – or nothing if these characters do not appear in the text. In a regex, however, some characters have special meaning – these are called meta characters. The most important meta characters are:
Expression
Description
.
Matches any character.
|
Either expression on its left and right side matches the target string. For example, ‘a|b’ matches ‘a’ and ‘b’.
[]
Any of the enclosed characters may match the target character. For example, ‘[ab]‘ matches ‘a’ and ‘b’. ‘[0-9]‘ matches any digit.
[^]
None of the enclosed characters may match the target character. For example, ‘[^ab]‘ matches all characters except ‘a’ and ‘b’. ‘[^0-9]‘ matches any non-digit character.
*
Character to the left of the asterisk in the expression should match 0 or more times. For example, ‘be*’ matches ‘b’, ‘be’ and ‘bee’.
+
Character to the left of the plus sign in the expression should match 1 or more times. For example, ‘be+’ matches ‘be’ and ‘bee’ but not ‘b’.
?
Character to the left of the question mark in the expression should match 0 or 1 time. For example, ‘be?’ matches ‘b’ and ‘be’ but not ‘bee’.
{num}
Character to the left of the enclosed number should match num times. For example, ‘be{2}’ matches ‘bee’ but not ‘be’.
()
Creates a group and ‘remembers’ the matching area of the string. Groups can be used to re-order parts of a string, e.g. when converting dates to a different format.
\
Escape character. If you want to use the character ‘\’ itself, you should use ‘\\’.
Confusing? This table is only meant as a short summary and reference – the meaning of all of these expressions will be clarified in the areas below.
For now, let’s focus on the first one, the dot. In a regex it means ‘any character may stand here’. So the expression ‘No…’ in a regex will match any of the following:
· Notes
· Notte
· No…
· No&%X
 
So what do you need to write in a regex to match precisely ‘No…’ and no other text? To use a character that has a special meaning, you must ‘escape’ it: that is, precede with a backslash. Thus, ‘No\.\.\.’ will match exactly ‘No…’ and nothing else.
How to test regular expressions
In memoQ regexes are used to define segmentation rules and auto-translation rules, but not to search for text. So how can you sharpen your skills? Here’s a trick to ‘abuse’ auto-translation rules to experiment with regular expressions. Create a test project, and in the Settings pane of Project home click the Auto-translation rules tab. In the dialog that appears, delete every rule already there, and enter a rule of your own. For that rule, also add a replace order rule so that you see the dialog fields filled as shown below. (What a replace order rule means and why you need it here will be explained below.)

MemoQ regexp 1 Regular expressions

Now click Preview, type the text shown below in the Before auto translation box, and click the Preview button. You will see the following:

MemoQ regexp 2 Regular expressions

The ‘x’ in the Replace order rules field tells memoQ to replace text which the specified regex matches with a letter ‘x’ – that’s how you know that your regex is working in this experiment. In the Auto translation preview dialog you can see exactly which parts of the text you provided are replaced by an ‘x’, allowing you to test your regex.
Character classes
Now that we’ve covered the dot and know how to experiment with new regexes, let’s move on to some more serious expressions. Brackets in regexes allow you to specify a set of characters, or a character class. ‘[ab][01]‘ will match two-character-long sequences where the first character is either an ‘a’ or a ‘b’, and the second is either a ’0′ or a ’1′. This yields 4 possible matches: ‘a0′, ‘b0′, ‘a1′, ‘b1′.
Character classes can be used to express things like ‘a digit followed by a comma or an exclamation mark’ – which could be expressed as ‘[0123456789][,!]‘. This, however, would be a very inconvenient thing to write. Regexes know better: you can specify a range of characters by writing ‘[0-9][,!]‘, which is exactly the same as the previous expression.
Note: Can you use ranges to say ‘match an alphabetical letter’? Yes and no. A typical solution to do this used to be ‘[a-z]‘, which matches any of the letters between a and z. Keep in mind, however, that MmemoQ works with many different languages which often have special characters in their alphabet. The Icelandic letter ‘đ’, for instance, is definitely not in the range a-z. Therefore memoQ uses a special extension to deal with alphabetical letters, which will be described below.
Also, keep in mind that all letters in memoQ regexes are interpreted in a case-sensitive way. Thus, ‘[a-z]‘ will match ‘f’ but not ‘F’.
Besides specifying what you want to match, you can also use character classes to specify what not to match. The regex ‘[^0a].’ will match an infinite number of two-character sequences, so long as the first character is not ’0′ or ‘a’.
Escape sequences
As you saw above, you can specify the original meaning of the special meta characters by preceding them with a backslash (‘\’), or escaping them. There are also other practical escape sequences available. The ones most important for the purposes regexes are used for in memoQ are:
Sequence
Description
\s
Whitespace: space, tab or newline
\S
Anything but whitespace
\t
Tab
\n
Newline
\d
Digit (between 0 and 9)
\D
Anything but digits
\w
Alphanumeric character and underscore
\W
Anything but alphanumeric characters
Quantifiers
Now that you’ve learned to specify a set of alternative characters to match at a given position, it’s time to move down the road and tell memoQ how many characters to match. The special characters ‘*’ and ‘+’, and the expression {num} are used for this purpose.
· The regex ‘x+’ will match a sequence of characters which consists of one or more ‘x’s – thus, ‘x’, ‘xx’, ‘xxx’ and so on.
· The regex ‘x{3}’ will match a sequence of characters which consists of exactly 3 ‘x’s – thus, ‘xxx’, but not ‘x’ or ‘xx’. If the text is ‘xxxx’, the regex will match the first 3 ‘x’s and ignore the fourth. Visually: ‘xxxx’. For a parallel, remember that the traditional Find dialog will find the word ‘cat’ in ‘cats’.
· You can use the {num} quantifier in a special flavor by specifying a minimum or maximum value (or both). Thus, ‘x{3,5}’ will match between 3 and 5 ‘x’s; ‘x{3,}’ will match any sequence with at least 3 ‘x’s; and ‘x{,5}’ will match any sequence with at most 5 ‘x’s.
· Perhaps the funniest of the quantifiers is the asterisk (‘*’). Its meaning is ‘match zero or more of the given character’. What on earth is that good for? Well, you can say things like “match the letter ‘T’ preceded by some ‘a’s – or maybe none”. The corresponding regex is ‘a*T’, which will match ‘T’, ‘aT’, ‘aaT’ and so on.
· A little less exciting but no less useful quantifier is the question mark. Its meaning is to match zero or one of the character in front of it. Thus, ‘ax?y’ will match ‘ay’ and ‘axy’, but not ‘axxy’.
 
If you think quantifiers are fun, it’s time to combine them with character sets. Just as after characters, you can write quantifiers after character sets. ‘[0-9]+%’ will match a sequence of digits followed by a percentage sign; for instance, ’1%’ or ’99%’, but not ’10a%’.
Groups and Alternatives
Having covered character sets and quantifiers, there are only two standard regex features left to explore: groups and alternatives.
Using the pipe (‘|’) symbol you can join several smaller regexes to say ‘match either this, that or the other thing’. The regex ‘EUR|USD|GBP’ will match any of these words, and only these.
When working with alternatives you mostly need to group them together using parentheses to get the desired results. Let’s say you want a regex that matches any of these expressions: ‘EUR 15 million’, ‘USD 37 million’ and ‘GBP 5 million’. As a first try, you might be inclined to write ‘EUR|USD|GBP \d{1,} million’. This, however, will not do, as it only matches the following strings: ‘EUR’, ‘USD’ and ‘GBP [any natural number] million’. You need to group your alternatives together in the regex: ‘(EUR|USD|GBP) \d{1,} million’, where ‘EUR|USD|GBP’ can be either ‘EUR’ or ‘USD’ or ‘GBP’ and ‘\d{1,}’ can be any natural number starting from zero.
Replacing and reordering
For the purposes of segmentation, memoQ only uses regexes to match patterns in the translation document’s text. For auto-translation rules it also makes use of another powerful regex feature that has to do with groups: replacing and reordering parts of the matched text.
· Replacing a matched text with a single string:
 
You already saw a possible use for replacement in the How to test area of this page. There we defined the rather simplistic Replace order rule of ‘x’ to replace a regex match with the letter ‘x’ for the purposes of testing.
· Reordering and/or replacing parts of a matched text:
 
Here you need to group all those parts of the regex in pair of parentheses that you want to reference. The match enclosed in every pair of parentheses is remembered by memoQ and assigned a number starting with 1. When writing the replace order rule you can reference these remembered substrings by ‘$1′, ‘$2′ etc., in the order of the opening parenthesis’ appearance in the regex.
Using the previous regex example, you have to put also ‘\d{1,}’ in parentheses to make reordering of these currencies and their values possible: ‘(EUR|USD|GBP) (\d{1,}) million’. In the replace order rule you can reference ‘EUR|USD|GBP’ by ‘$1′, and ‘\d{1,}’ by ‘$2′. So if you want to change their order, the replace order rule could be ‘$2 Millionen $1′.
memoQ extensions
For the purposes of segmentation and defining auto-translatation rules it is often useful to work with lists of words – abbreviations, the names of months, currencies etc. In theory it would be possible to list these words grouped together as alternatives in the regular expressions, as you saw in the preceding area. However, doing so would result in very complicated and hard to maintain regexes. memoQ therefore introduces a special extension to regular expressions: custom lists.
Lists of words used in regular expressions can be defined in the Custom lists tab of the segmentation rules dialogs or of the auto-translation rules dialogs, or in the Translation pairs tab of the auto-translatables dialogs.
· The custom lists in the Custom lists tab of the segmentation rules dialogs should contain characters, abbreviations that are important for segmentation (e.g. ‘.’, ‘!’, ‘e.g.’).
· The custom lists in the Custom lists tab of the auto-translatables dialogs should contain words that have the same source and target form (e.g. ‘€’, ‘$’).
· The custom lists in the Translation pairs tab of the auto-translatables dialogs should contain source words with their target equivalents (e.g. In English-German projects ‘January’ should be translated as ‘Januar’, ‘February’ as ‘Februar’ etc.).
 
The name of a custom list must always start and end with a hash mark (‘#’). The words that make up a custom list are always interpreted as plain text, i.e. no characters are treated as meta characters with a special meaning.
Note: For segmentation rules memoQ defines one more special item: ‘#!#’. This extension does not influence regex matching in any way. Instead, it tells memoQ to introduce a segment break at the given location if the expression matches text in the imported document.
Example for using custom lists of the Custom lists tab of the auto-translation rules dialogs.
If you want memoQ to offer you ’15 Millionen EUR’ in the Translation results pane for every occurrence of ‘EUR 15 million’ and ’37 Millionen USD’ for ‘USD 37 million’. Create a custom list labeled ‘#currency#’ in the Custom lists tab containing ‘EUR’, ‘USD’ and ‘GBP’.

MemoQ regexp 3 Regular expressions

Now create the following regex ‘(#currency#) (\d{1,}) million’ (equivalent with ‘(EUR|USD|GBP) (\d{1,}) million’) for which the replace order rule could be ‘$2 Millionen $1′. The preview of the above regex and replace order rule will yield the following result:

MemoQ regexp 4 Regular expressions

If you want memoQ to offer you ’15 Millionen Euro’ in the Translation results pane for every occurrence of ‘EUR 15 million’ and ’37 Millionen Dollar’ for ‘USD 37 million’. Create a custom list labeled ‘#currency#’ in the Translation pairs tab containing the following translation pairs: ‘EUR’ – ‘Euro’, ‘USD’ – ‘Dollar’ and ‘GBP’ – ‘Pfund’.

MemoQ regexp 5 Regular expressions

Now create the following regex ‘(#currency#) (\d{1,}) million’ for which the replace order rule could be ‘$2 Millionen $1′. The preview of the above regex and replace order rule will yield the following result:

MemoQ regexp 6 Regular expressions
Source: http://memoq.helpmax.net