13.2 Regular expressions to find more flexible patterns

Special characters used for pattern recognition:

$ | Find pattern at the end of the string |
^ | Find pattern at the beginning of the string |
{n} | The previous pattern should be found exactly n times |
{n,m} | The previous pattern should be found between n and m times|
+ | The previous pattern should be found at least 1 time |
* | One or more allowed, but optional |
? | One allowed, but optional |

Match your own pattern inside []

\[abc\]: matches a, b, or c.
^\[abc\]: matches a, b or c at the beginning of the element.
^A\[abc\]+: matches A as the first character of the element, then either a, b or c
^A\[abc\]*: matches A as the first character of the element, then optionally either a, b or c
^A\[abc\]{1}_: matches A as the first character of the element, then either a, b or c (one time!) followed by an underscore

\[a-z\]: matches every character between a and z.
\[A-Z\]: matches every character between A and Z.
\[0-9\]: matches every number between 0 and 9.

Match anything contained between brackets (here either g or t) at least once:

grep(pattern="[gt]+", 
    x=c("genomics", "proteomics", "transcriptomics"), 
    value=TRUE)

## [1] "genomics"        "proteomics"      "transcriptomics"

Match anything contained between brackets at least once AND at the start of the element:

grep(pattern="^[gt]+",
        x=c("genomics", "proteomics", "transcriptomics"),
        value=TRUE)

## [1] "genomics"        "transcriptomics"

Create a vector of email addresses:

vec_ad <- c("marie.curie@yahoo.es", "albert.einstein01@hotmail.com", 
    "charles.darwin1809@gmail.com", "rosalind.franklin@aol.it")

Keep only email addresses finishing with “es”:

grep(pattern="es$",
        x=vec_ad,
        value=TRUE)

## [1] "marie.curie@yahoo.es"