Basic Regular Expressions

#!/usr/bin/perl -w

#######################  Perl Regular Expressions #######################

#   Expression           Meaning
#   ----------           -------

#      \d                One digit character.
#      \D                One non-digit character.

#      \s                One whitespace character.  A whitespace
#                        character is a newline (\n), carriage
#                        return (\r), space, tab (\t) or formfeed (\f).

#      \S                One non-whitespace character.

#      \w                One "word" character (digit, letter or _).
#      \W                One non-word character.
 
#      *                 Longest string of zero or more of the
#                        preceding character or character group.

#      +                 Longest string of ONE or MORE of the
#                        preceding character or character group.

#      ?                 Zero or one only of the preceding character
#                        or character group.

#      (chars)           Tag characters for purpose of recalling via
#                        $1, $2, etc. or for grouping to use with
#                        *, +, or ? expressions.

#      ? after + or *    Turn off greedy matching e.g., .*?X means
#                        shortest string of anything before first X.

#      {3}               Three of preceding character (or group).
#      {3,}              Three or more of preceding char. (or group).
#      {3,5}             Between three and five of the preceding
#                        character or character group.
 
#      \b                Zero-width word boundary.

#      |                 "Or" bar -- used to list alternatives.

#      (?:chars)         Group chars. but DO NOT TURN INTO A MEMORY
#                        VARIABLE i.e., $1, $2, etc..

#      [chars]           One character which is a member of chars.
#      [^chars]          One character which is NOT a member of chars.

#      \                 Take next character literally (NOT a regexp.).
#      \Q....\E          Take ALL characters between \Q and \E literally

#      .                 Match any ASCII character (except newline!!!).

#      ^ and $           Beginning (ending) of line.
###################  Code Demo of Regular Expressions  ####################

$var = "ABC123";
$var =~ s/\d//g;   #  Remove all digit characters.
print "$var\n\n";

$var = "ABC123";
$var =~ s/\D//g;   #  Remove all non-digit characters.
print "$var\n\n";

$var = "%ABC*34_";
$var =~ s/\w//g;   #  Remove all "word" characters (letters, digits, underscore)
print "$var\n\n";

$var = "%ABC*34_";
$var =~ s/\W//g;   #  Remove all non-word characters. 
print "$var\n\n";

$var = "roughhouse millhouse housefly housecat";
$var =~ s/\bhouse/X/g;   #  Change "house" to X if word boundary before "house".
print "$var\n\n";

$var = "area careen arena bare"; 
$var =~ s/\Bare\B/X/g;   #  Change are to X if no word boundary before
print "$var\n\n";          #  or after "are".

$var = "a:b:c:d:e";
$var =~ s/^(.*):([^:]*)/$2:$1/;     #  Swap tagged fields.  If you want to swap
print "$var\n\n";                     #  2 sides of first colon this is not right!

$var = "a:b:c:d:e";
$var =~ s/^(.*?):([^:]*)/$2:$1/;   #  Swap tagged fields.  If you want to swap
print "$var\n\n";                     #  2 sides of first colon this IS right!!


####  ? Turns off "greedy" matching caused by the "*" or "+" regexp characters.

$var = "123:45:67890:7654321";
$var =~ s/\d{3,}/X/g;            #  Three or more digits become one X.
print "$var\n\n";

$var = "123:45:67890:7654321";   #  Each 3-digit groups become one X.
$var =~ s/\d{3}/X/g;
print "$var\n\n";

$var = "123:45:67890:7654321";
$var =~ s/\d{2,4}/X/g;           #  Two to four digits (greedy!) become X.
print "$var\n\n";

$var = "1.2.3.4.5.6";
$var =~ s/\./X/;      #  Change literal period to X -- first occurrence only!!
print "$var\n\n";

$var = "1.2.3.4.5.6";
$var =~ s/\./X/g;     #  Change literal period to X -- all occurrences only!!
print "$var\n\n";
$var = "*.?+][";
$var2 = ".?+";
$var =~ s/\Q$var2\E/X/;   #  Characters in $var2 taken LITERALLY!
print "$var\n\n";

$var = "a1b2c3X4Y5Z6";
$var =~ s/[a-z]/*/g;      #  Change each lowercase letter to an asterisk.
print "$var\n\n";

$var = "a1b2c3X4Y5Z6";
$var =~ s/[a-z]/*/gi;     #  Change each letter to an asterisk. 
print "$var\n\n";         #  i option means "case insensitive".

$var = "a1b2c3X4Y5Z6";
$var =~ s/[^a-z]/*/g;     #  Change each non-lowercase letter to an asterisk.
print "$var\n\n";

$var = "a1b2c3X4Y5Z6";
$var =~ s/[^a-z]/*/gi;    #  Change each non-letter to an asterisk.
print "$var\n\n";

$var = "a1b2c3d4e5f6g7";
$var =~ s/[a-c5-7F]/X/g;
print "$var\n\n";         #  Change a through c, 5 through 7, and F to X.

$var = "abcdefg";
$var =~ s/^./X/;          #  Change first char of line to X.
print "$var\n\n";   

$var = "abcdefg";
$var =~ s/.$/X/;          #  Change last char of line to X.
print "$var\n\n";   

$var = "abcdefg";
$var =~ s/^(.)(.*)(.)$/$3$2$1/;   #  Swap first and last character.
print "$var\n\n";

print "Enter string: ";
while (($line = <STDIN>) =~ /\d/)  #  As long as $line has a digit keep going.
{
     print "$line\n";
     print "Enter string: ";
}

print "Enter string: ";
while (($line = <STDIN>) !~ /\d/)  #  As long as $line has NO digit keep going.
{
     print "$line\n";
     print "Enter string: ";
}

print "Enter string: ";
while (($line = <STDIN>) !~ /^\s*quit\s*$/i)  #  As long as $line is not "quit"
{                                             # (any case with allowable 
     print "$line\n";                         #  surrounding whitesp) keep going
     print "Enter string: ";
}
##############################  Program Output  ############################

$ regexp.pl
ABC    #  All digit chars removed.

123    #  All non-digit chars removed.

%*     #  All word chars removed.

ABC34_ #  All non-word chars removed.

roughhouse millhouse Xfly Xcat   #  "House" changed to X if it starts a word.

area cXen arena bare             #  "Are" changed to X if no word boundary on
                                 #  either side.
e:a:b:c:d            #  Bad swap!

b:a:c:d:e            #  Two sides of first colon swapped.

X:45:X:X             #  Three to infinity digits changed to X's.

X:45:X90:XX1         #  EXACTLY three digits changed to X's.

X:X:X0:XX            #  Two to four digits (greedy!!) changed to X's.

1X2.3.4.5.6          #  First period changed to X.

1X2X3X4X5X6          #  All periods changed to X.

*X][                 #  Substitute done with \Q...\E around pattern. 

*1*2*3X4Y5Z6         #  Lowercase letters changed to *'s.

*1*2*3*4*5*6         #  All letters changed to *'s.

a*b*c*******         #  Each non-lowercase letter character changed to *.

a*b*c*X*Y*Z*         #  Each non-letter character changed to *.

X1X2X3d4eXfXgX       #  a thru c, 5 thru 7, and F changed to X.

Xbcdefg              #  First character of line changed to X.

abcdefX              #  Last character of line changed to X.

gbcdefa              #  First and last character of line swapped.

Enter string: abc3   #  Start loop which stops if $line doesn't have a digit.
abc3

Enter string: x*7io
x*7io

Enter string: abc     #  Exits first while because has no digit.

Enter string: XYZ(*&  #  Start loop which stops if $line DOES have a digit.
XYZ(*&
Enter string: jh_)(
jh_)(

Enter string: gh7Y    #  Exits second while because has a digit.

Enter string: this    #  Start loop which stops when user enters "quit".
this

Enter string: is 
is 

Enter string: not
not

Enter string:   QuiT  #  Exit third while loop.

#############################  Some Important Notes  ########################

1.  The substitution expression itself has a value which can sometimes be
    useful.  For example:

    $str = "abc4rp67###8";
    $countdigits = $str =~ s/\d/X/g;
    print "$countdigits\n";    #  Will print 4


2.  You must remember that + and * are GREEDY matchers.  Here are some common
    mistakes of forgetting this:

    $str = "yes:no:maybe:perhaps";
    $str =~ s/^(.*):(.*)/$2:$1/;
    print "$str\n";            #  Will print perhaps:yes:no:maybe

    If you wanted to swap the two sides of the very first colon, the following
    would have been correct:

    $str =~ s/^(.*?):([^:]*)/;

    The ? after the .* says that you want the SHORTEST string of zero more
    ASCII characters leading up to a colon, NOT the LONGEST!!  The second
    tagged field must be [^:]*  -- the longest string of zero or more
    NON-COLONS.  You don't want to go past the second colon if there is one!!

3.  Do NOT use the old-fashioned Unix regular expressions if Perl has a better
    one.  For example, use \d for a digit, \D for a non-digit, \s for a 
    whitespace character, and \S for a non-whitespace character. 

4.  Remember that sometimes the ^ (beginning of line) and $ (end of line)
    expressions are vital if you need to describe an ENTIRE LINE.  For
    example, if you want to match a line which is empty of all whitespace,
    the regular expression is:

    /^\s*$/

    The ^ must be present or you are allowing for the possibility that some
    non-whitespace characters precede the \s*.  The $ must be present or you
    are allowing for the possibility that some non-whitespace characters
    follow the \s*.  Be alert!  Also, remember that [^whatever] means one
    character that is NOT a member of the set enclosed in the [^...].