Articles

Affichage des articles du août, 2018

On the use of OR | operator in regular expression

I had some difficulties to use the OR | operator in regular expressions. Let do some exercises and examine the results: gregexpr("bc", "abcd") : return 2; normal, bc is located at the second position gregexpr("b|c", "abcd") : return 2 3; | is the OR operator and it will return the position of b OR c this is the same than : gregexpr("[bc]", "abcd") or gregexpr("[b|c]", "abcd") But | and [] are not always the same: Let do more complicated: gregexpr("ab|cd", "abcd") My problem was about the priority order of | operator. The return here is 1 3 It means that the search was for a followed by b OR c followed by d. Let do more complicated : gregexpr("a[bc]d", "abcd") : return -1; it search for abd or acd. None exist In practice, it is good to remember that | separates blocks of comparisons.

[0-9] or \d or [:digit:] in grep

If you want search for a number using grep, you have 4 options: [0123456789] or [0-9] or \d or [:digit:] Here is a comparison: library(microbenchmark) microbenchmark({grep(pattern="[0123456789][0123456789][0123456789][0123456789]", "Ussiosuus8980JJDUD98")},                 {grep(pattern="[0-9][0-9][0-9][0-9]", "Ussiosuus8980JJDUD98")},                 {grep(pattern="[[:digit:]][[:digit:]][[:digit:]][[:digit:]]", "Ussiosuus8980JJDUD98")},                 {grep(pattern="\\d\\d\\d\\d", "Ussiosuus8980JJDUD98")}, times = 1000L)     min      lq      mean  median      uq    max neval  cld  21.729 22.1840 22.887585 22.3090 22.4645 64.542  1000    d   5.219  5.3740  5.696937  5.4665  5.8495 47.618  1000 a       9.778 10.0425 10.482301 10.1470 10.5400 25.602  1000  b    10.296 10.5000 11.099359 10.6360 11.0440 39.759  1000   c  Clearly the solution [0-9] is the best This is the same for l