Tuesday, September 20, 2011

Guide to regular expressions with examples

The regular expression, or regexp, are the most powerful, versatile and hated tool used by programmers and system administrators.
They allow to express with a few characters search for strings, characters or words, and if done well can lead to good results, but if they are wrong they can not give you any useful result, and the worst thing is that often it is difficult to understand whether or not a regepx it is written with a correct syntax to cover all the possibility.
But now as first thing let’s see what is a regular expression:
From WIkipedia
In computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.



Syntax: it’s the same for all programs / languages​​?

Usually, yes in javascript and perl the syntax is similar, using preg_replace in php the syntax it’s the same of perl, and ereg … well, now ereg it is deprecated .
Even the IDE and text programs such as vim, notepad++, Komodo Edit, Dreamweaver, etc.. support search & replace with regular expressions.
In general the syntax may change, but not by much. Anyway, if you learn the syntax of perl you can still get away with any other variant

metacharacters

In regular expressions there are several “special characters ” with different functions:
. (dot) Means any character except those that identify a new line (\n \r)
Example
$text = "espressioni regolari!";
preg_match_all('/./', $text, $ris);
// Will match all characters
^ identifies the beginning of a line, if at the beginning of a group denies the group itself
Example
$text = "espressioni regolari!";
preg_match_all('/^./', $text, $ris);
// It will find only the character "e"
$ identifies the end of a line
Example
$text = "espressioni regolari!";
preg_match_all('/.$/', $text, $ris);
// It will find only "!"
| it’s an OR condition
Example
$text = "espressioni regolari!";
preg_match_all('/a|i|u|o|e/', $text, $ris);
// You will find all the vowels
() parentheses identify the groups of characters
[] Brackets indicate ranges and character classes
\ This character cancels the effects of the next metacharacter
Esempio
$text = "espressioni.regolari!";
preg_match_all('/\./', $text, $ris);
// You will find only the . (dot)

Quantifiers
Quantifiers, as the term itself, indicate how often search for a given sequence of characters.

* (star) indicates 0 or more occurrences
Example
$testo = "Espressioni, pesi, piume!";
preg_match_all('/s*i/', $testo, $ris);
// Will match "ssi" of espressioni,
// "si" in pesi
// and "i" of word piume
+ indicates 1 or more occurrences
Example
$testo = "Espressioni, pesi, piume!";
preg_match_all('/s+i/', $testo, $ris);
//  Will match "ssi" of espressioni,
// and "si" in pesi
? indicates 1 or 0 occurrences
Example
$testo = "Espressioni, pesi, piume!";
preg_match_all('/s?i/', $testo, $ris);
// Will match "si" and "i" in espressioni,
// "si" in pesi
// and "i" si piume
{N} Research exactly n occurrences, to remember that the curly brackets are considered normal characters in all other contexts
Example with a replace
$testo = "ese, esse, essse, esssse!";
$testo = preg_replace('/es{2}e/', '*', $testo);
// Now $testo will be "ese, *, essse, esssse!"
{N,} Research at least n occurrences, see above
Example with a replace
$testo = "ese, esse, essse, esssse!";
$testo = preg_replace('/es{3,}e/', '*', $testo);
// Now $testo will be "ese, esse, *, *!"
{N,M} Research at least n occurrences, but not more than m, see above
Example with a replace
$testo = "ese, esse, essse, esssse!";
$testo = preg_replace('/es{2,3}e/', '*', $testo);
// Now $testo will be "ese, *, *, esssse!"

Quantifiers ungreedy

Almost everyone sooner or later stumble into this problem: if I use an expression such as /”.*”/ i will find all the words enclosed in double quotes? Unfortunately, no!
This is because the standard quantifiers are “greedy”, that seek the greatest possible occurrence.
Let’s look at an example:
Example
$testo = 'class="pluto" id="pippo"';
preg_match_all('/".*"/', $testo, $ris);
// Will find a single occurrence:
// "pluto" id="pippo"
As you can see is not the desired result! How so?
Just add a question mark at the end of our quantifiers
Example
$testo = 'class="pluto" id="pippo"';
preg_match_all('/".*?"/', $testo, $ris);
// Now it will find "pluto" e "pippo" !
This applies to any quantifier described above!

classes and ranges

The classes determine a list of characters, character classes or POSIX (see next section) to be searched. They are enclosed in square brackets and can be followed by the quantifiers.
Example
$testo = 'Questa è una stringa lunga lunga di esempio';
preg_match_all('/[aiuoe]{2}/', $testo, $ris);
// The expression will search for two successive vowels,
// so it will find "ue" and "io"
To identify a range use the minus sign (-). For example, a-z identify all lowercase characters a through z, F-R uppercase characters from R to F, 0-5 numbers from 0 to 5 and so on.
Example
$testo = 'caratteri 16sdf456 e un colore esadecimale 94fa3c ';
preg_match_all('/[0-9a-f]{6}/', $testo, $ris);
// The expression will search for 6 characters that are numbers or letters from a to f
// so it will find "94fa3c"
The ^, if, immediately after the opening bracket negates the whole range, indicating to not seek the characters included.
Example
$testo = 'Questa è una stringa lunga lunga di esempio';
preg_match_all('/[^aiuoe ]{3}/', $testo, $ris);
// The term will search for 3 letters that are NOT vowels or spaces
// s oit will find only "str"

character classes and POSIX

The POSIX character classes and are used to specify a set of characters at the same time, without using the groups.
Class : \w
Matches : [a-zA-Z0-9_]
Description: Search a character “word” (w stands for word), ie letters, numbers and “_”
Example:
$testo = "[[Le_Regex sono_belle!!!]]";
preg_match_all('/\w+/', $testo, $ris);
// Will match "Le_Regex" and  "sono_belle"
Class: \d
Matches : [0-9]
Description: Research a number (d stands for digit)
Example:
$testo = "123 stella! 456 cometa!";
preg_match_all('/\d+/', $testo, $ris);
// Will match "123" and "456"
Class: \s
Matches: [ \t\r\n\v\f]
Description: research space, including tabs and newlines
Example:
$testo = "manuale sulle
          espressioni regolari!";
$testo = preg_replace('/\s+/', '', $testo);
// Now testo will be manualesulleespressioniregolari!
These 3 classes means the opposite if you use the same letter but capitalized.
Thus, for example, \D search anything that is not a number.
Class: [:ALNUM:]
Matches: [a-zA-Z0-9]
Description: Search alphanumeric characters, without “_”
Example
$testo = "[[Le_Regex 123 sono_belle!!!]]";
preg_match_all('/[[:alnum:]]+/', $testo, $ris);
// Will match "Le","Regex", "123",
// "sono" and "belle"
Class: [:ALPHA:]
Matches: [a-zA-Z]
Description: Search alphabetic characters
Example
$testo = "[[Le_Regex 123 sono_belle!!!]]";
preg_match_all('/[[:alpha:]]+/', $testo, $ris);
// Will match "Le","Regex", "sono" e "belle"
Class: [:BLANK:]
Matches: [ \t]
Description: Search only spaces and tabs
Example
$testo = "questa è una prova
	con spazi e tabulazioni";
$testo = preg_replace('/[[:blank:]]+/', '', $testo);
/* $testo Now will be:
questaèunaprova
conspazietabulazioni
*/
Class: [:UPPER:]
Matches: [A-Z]
Description: Research uppercase
Example
$testo = "ESPRESSIONI regolari";
preg_match_all('/[[:lower:]]+/', $testo, $ris);
// Will match "ESPRESSIONI"

Modifiers

Each search operation can use several modifiers, which, as its name implies, can change the default search criteria.
These modifiers should be placed at the end of the search string, immediately after the character limitation.
You can combine multiple effects without appending modifiers spaces (for example: /imsu will apply all 4 the effects described below).
i the search becomes case-insensitive, ie upper and lower case are considered equal
$testo = "Le Espressioni Regolari sono regolari?";
preg_match_all('/regolari/i', $testo, $ris);
// Will match both "regolari" and "Regolari"
m the research will be considered “for each line”, ie the anchors like “^” and “$” will be applied for each line of text
$testo = 'Espressioni Regolari
Espressioni in perl
Espressioni php';
preg_match_all('/^Espressioni/m', $testo, $ris);
// will match all 3 "Espressioni"
// and not just the first
s the text is regarded as a single line and “.” now also identifies newline characters, which would not normally
$testo = 'Espressioni Regolari
Espressioni in perl
Espressioni php';
preg_match('/perl.Espressioni/s', $testo);
// research will be successful
u are enabled Unicode characters in full, as \x{10FFFFF}
$testo = '紫の触手、緑の触手';
preg_match('/\x{89e6}\x{624b}/u', $testo, $ris);
// research will be successful
U activated ungreedy for all quantifiers
$testo = 'class="pluto" id="pippo"';
preg_match_all('/".*"/U', $testo, $ris);
// it's the same as /".*?"/ will match both "pluto" and "pippo"

Anchors
Anchors identify the location where to search our text.

^ Identifies the beginning of the string, with the modifier /m identifies the beginning of each line
$testo = 'Questo è un esempio
sulle Espressioni Regolari
nella sintassi di perl';
preg_match_all('/^[\w]+/m', $testo, $ris);
//will match "Questo", "sulle" e "nella"
$ Identifies the end of the string, with the modifier /m identifies the end of each line
$testo = 'Questo è un esempio
sulle Espressioni Regolari
nella sintassi di perl';
preg_match('/[\w]+$/m', $testo, $ris);
// Will match
// "esempio", "Regolari", "perl"
\A similar to ^ , identifies only the beginning of the string, even if there is the modifier / m
$testo = 'Questo è un esempio
sulle Espressioni Regolari
nella sintassi di perl';
preg_match_all('/\A[\w]+/m', $testo, $ris);
// will match only "Questo"
\Z similar to $, identifies only the end of the string, even if there is the modifier /m
$testo = 'Questo è un esempio
sulle Espressioni Regolari
nella sintassi di perl';
preg_match_all('/[\w]+\Z/m', $testo, $ris);
//will match only "perl"
\b identifies the point between two characters that are \w at the left and not \w at the right
$testo = 'condor daino dingo elefante';
preg_match_all('/\bd\w+/', $testo, $ris);
// The search will find only the words that begin
// with the letter d, so "daino" and "dingo"
\B identifies the opposite of \b
$testo = 'condor daino dingo elefante';
preg_match_all('/\Bd\w+/', $testo, $ris);
// The search will find only a set of characters
// beginning with d, which is not the beginning of
// a word, in this case "dor"

The special characters

If you have some knowledge of php or perl you will have already had to deal with that placeholder that identify special characters such as newline or tabs. All All of these placeholders begin with the character \
Here is a list with a lot of detail, to emphasize the importance of \Q!!
\t tab (HT, TAB)
\n line endings (LF, NL)
\r carriage return (CR)
\Q This disables any wild card up to \ It is very useful to insert variables into the string
$variabile = '[\w\s]+';
$testo = "Questo testo [\w\s]+[\w\s]+ ha regex al suo interno!";

$testo1 = preg_replace('/'.$variabile.'/', '', $testo);
// $testo1 will be "[\\]+[\\]+!"

$testo2 = preg_replace('/\Q'.$variabile.'\E/', '', $testo);
// $testo2 will be "Questo testo  ha regex al suo interno!"
\E see above
\nnn character in octal form where n is a number from 0 to 7
\Xnn character as a hexadecimal number where n is a hexadecimal …
\f form feed (FF)
\a alarm / bell (BEL)
\and escape (BEL)

groups

The groups are enclosed by parentheses and become essential at the time of replacement, because you can recall them. An example to clarify everything:
$testo = "This is a date in mysql format: 2010-01-28";
$testo = preg_replace('/(\d{4})-(\d{2})-(\d{2})/', 'This is a date in European format: $3/$2/$1', $testo);
// Now $testo will be "This is a date in European format: 28/01/2010"
As you can see the expression has three groups and in the substitution there are dollars followed by a number: this number represents the found text from the corresponding group. So the first group will be $1, $2 the second and so on.
In groups you can also add a logical “OR”, ie to find a set of characters or another
$testo = "Si dice ha piovuto o è piovuto in italiano?";
$testo = preg_replace('/\s+((ha|è)\s+piovuto)\s+/', ' è nevicato ', $testo);
// This will search "ha piovuto" OR "è piovuto" followed or preceded by spaces;
// $testo now will be "Si dice è nevicato o è nevicato in italiano?"

Conclusions

Here finishes the introduction to regular expressions, i suggest to use some software like JRegexregexxer or KRegexpEditor to get some help, and also a cheat sheet can be really useful. Source: http://linuxaria.com