Regular Expressions

From TRCCompSci - AQA Computer Science
Revision as of 15:24, 26 January 2019 by Admin (talk | contribs) (Example 4:)
Jump to: navigation, search

A regular expression is a notation for defining all the valid strings of a formal language.

https://www.youtube.com/watch?v=FNeaf1zm01g&index=4&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B

Examples of Regular Expression Notation

Regular Expression Meaning
a Matches a string consisting of just the symbol a
b Matches a string consisting of just the symbol b
ab Matches a string consisting of the symbol a followed by the symbol b
a* Matches a string consisting of zero or more a’s
a+ Matches a string consisting of one or more a’s
abb? Matches the string ab or the string abb. The ? symbol indicates zero or one of the preceding element
a|b Matches a string consisting of the symbol a or the symbol b

Ways to Remember the Symbols

'?' is questioning if it is there or not, so zero or one of the preceding elements.

'+' suggests you have one already and you want to add more, so its 1 or more of the preceding elements.

'*' is often used as a wildcard character and suggests whatever and anything goes, so zero or more of the preceding elements.

Precedence Rules

When using regular expressions, the rules of arithmetic precedence are as follows:

+ and * and so on are done first

Concatenation (ie joining elements together) is done next ie Brackets

| comes last

Examples

So 'ab+' will mean the '+' will only operate on the 'b'.

The '+' will be evaluated first & then the joining.

if you wanted atleast 1 'ab' pattern you would need to use brackets to get '(ab)+' .


So 'ab|cd' will join 'ab' and 'cd' before looking at the '|'.

If you wanted an 'a' followed by a 'b' or 'c', and finishing with a 'd' you will need to use brackets to get 'a(b|c)d' .

More Examples

Examples of regular expressions using the alphabet {a, b, c}

  • abc defines the language with only the string ‘abc’
  • abc | cba defines the language with two strings’ abc’ and ‘cba’
  • (a | b) c (a | b) gives four strings: ‘aca’, ‘acb’, ‘bca’, ‘bcb’
  • a+ gives an infinite number of strings: ‘a’, ‘aa’, ‘aaa’, etc
  • ab* gives an infinite number of strings: ‘a’, ‘ab’, ‘abb’, ‘abbb’, etc
  • (ab)* gives an infinite number of strings: ‘’, ‘ab’, ‘abab’, ‘ababab’, etc
  • (a | c)+ gives all possible strings of a and c (not including the empty string)

Regular expression meta-characters

Symbol Meaning Example
Used to separate alternatives a│b (Means a or b)
? Used to denote zero or one of the preceding element a? (0 or 1 as; matches with ‘’ & ‘a’)
* Used to denote zero or more of the preceding element a* (0 or more as; matches with ‘’, ‘a’, ‘aa’, etc.)
+ Used to denote one or more of the preceding element a+ (1 or more as; matches with ‘a’, ‘aa”’etc.)
( ) Used to group characters together, to indicate the scope of another operator (ab)* (Example 0 or more abs; matches with ‘’, ‘ab’, ‘abab’, etc.
[ ] Another way of denoting alternatives (instead of vertical bar). Defines a character class [ab] (means a or b)
\ The escape character (this turns the metacharacter into an ordinary character) a\* (the a character followed by the * character. Note: \ is needed as a* would mean zero or more as.)
^ Used to indicate the negation of a character class. Also used to match the position before the first character in a string a[^bc] (a followed by a character that is not a b or c) ^abc will match with abc only if it is at the beginning of a string
$ Used to match with the position after the last character in a string abc$ (will match with abc only if it is at the end of a string)
. Matches with any single character a.a (will match with any string that has an a followed by any character followed by an a e.g. ‘aca’, ‘aba’)
- Used to specify a range of values in a character class [A-Z] (character in the range of A to Z)

Remember you are expected to know ? * + | and the use of brackets, the others should be explained to you in the question.

Escape Characters

\ is the escape character. If it precedes a symbol, it means symbol has no function:

a\* means it will match exactly strings of ‘a*’

Previous exam questions have used escape characters, they have explained what the specific escape characters mean. They used them for digits, ie \a meant any alphabetic character & \n meant any numeric digit.

C# Regex Example

You will need to add to the using section of your program, we need to import System.Text.RegularExpressions. This will return partial matches as well as full matches. For example 'ab+' and 'aba' will return a match because it contains 'ab' even though it isn't a full match. To fix this use the '^' at the start to signify it must start with 'ab+' and the '$' at the end to signify it must end with 'ab+'.

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Regex regex = new Regex(@"^ab+$");
        Match match = regex.Match("abb");
        if (match.Success)
        {
            Console.WriteLine("MATCH VALUE: " + match.Value);
        }
    }
}

Regular Language

https://www.youtube.com/watch?v=qGfPe2g8VOs&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B&index=5

A regular language is a formal language that can be accepted by a finite state machine. Regular Expressions can also be specified using a FSM, an example question:

The FSM in below defines the language that allows all strings containing at least, either two consecutive 1s or two consecutive 0s.
The strings 0110, 00 and 01011 are all accepted by the FSM and so are valid strings in the language.
The strings 1010 and 01 are not accepted by the FSM and so are not valid strings in the language.

Regexfsm.png

For 3 marks you need to write the regular expression that matches this FSM.

Quiz

To see if you remember what you have learned about Regular Expressions, take this quick test either by yourself, or with friends: https://play.kahoot.it/#/?quizId=563cc210-1979-47aa-9a22-b2ab2fe94a8c


Maths for Regular Expressions

https://www.youtube.com/watch?v=XfKOIWi420s&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B&index=2


https://www.youtube.com/watch?v=cH9K02dihBc&index=3&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B

Basic set understanding

Sets describe collections of things or values, such as numbers, animals or people.

In a set, each value occurs only once. For example, the value fox will occur only once in the set of animals. Just as the value 3 occurs only once in the element of all natural numbers, N.

The cardinality of a set refers to the number of elements it contains. An empty set is written ∅ and its cardinality is 0.

Sets may be finite or infinite. For example, the set of people currently alive in the world will be finite, but the set of N is infinite.

A set may be countable or uncountable. A countable set is a set, whose elements can be matched with the set of natural numbers. In other words, it is possible to count the elements one-by-one. If a set is finite, it will always be countable. The best example of an uncountable set is R. It is impossible to match each element of R with an element of N. This is because there are not enough elements in N to match each one with an element of R.

Membership ∈ and its reverse ∉

x ∈ S, x is an element of S.

Examples:

  • x ∈ ℕ, meaning that x is an element of the set of natural numbers. For example, 3 ∈ ℕ. Whereas 3.5 ∉ ℕ
  • x ∈ Q, meaning that x is an element of the set of rational numbers. Whereas pi ∉ Q
  • x ∈ WorkingDays, meaning that x is an element of the set of all WorkingDays. For example, Monday ∈ WorkingDays. Whereas Saturday ∉ WorkingDays
  • x ∈ WeekendDays, meaning that xis an element of the set of all WeekendDays. For example, Saturday ∈ WeekendDays. Whereas Monday ∉ WeekendDays

Union ∪

A ∪ B, meaning that all elements of the set A form a union with all of the elements in set B. This is a set comprehension, since this generates a new set.

The union of two sets:

Union.png

Examples:

  • Q ∪ Irrational Number Set, meaning that all numbers in the set of rational numbers form a union with the set of all irrational numbers. This set comprehension generates the set of real numbers.
  • WorkingDays ∪ WeekendDays, meaning that all of the working days (elements of WorkingDays) form a union with weekend days (elements of WeekendDays). This set comprehension generates the set of WeekDays.
  • {1, 2} ∪ {2,3,4} = {1,2,3,4} (notice that only once instance of 2 is in the resulting set)
  • {1, 2, green} ∪ {red, white, green}={1, 2, red, white, green}
  • {1, 2} ∪ {1, 2} = {1, 2}

Intersection ∩

A ∩ B, meaning that the set A and set B form an intersection. The generated set will be ∅, if the two sets share no elements.

The intersection of two sets:

Intersection.png

Examples:

Let A be the set of numbers which are divisible by 3 and B the set of numbers divisible by 4.

  • A ∩ B, meaning that A and B form an intersection, which generates a set, which contains all of the numbers divisible by 3 and 4.
  • WorkingDays ∩ WeekendDays, meaning that WorkingDays and WeekendDays form an intersection, which is ∅

Now let A be the set of all A-level students who take computer science and B the set of all A-level students who take mathematics.

  • A ∩ B, meaning that A and B form an intersection, which generates a set, which contains all students who take both computer science and mathematics.
  • {1, 2, 3} ∩ {2, 3, 4} = {2, 3}
  • {1, 2, 3} ∩ {bear,hen,squirrel} = Ø (this means an empty set)
  • {cat, dog, canary} ∩ {wolf, canary, whale, cat} = {cat, canary}

Difference \

A \ B, meaning that the set A and set B form a set difference. This will generate a set, which contains the elements of A, which are NOT also in B.

The difference of two sets:

Difference.png

Examples:

  • R \ Q, meaning that R and Q form a set difference. This will generate the set of irrational numbers.

Let A be the set of all A-level students who take computer science and B the set of all A-level students who take mathematics.

  • A \ B, meaning that A and B form a set difference. This will generate the set of all computer science students, who do not also take mathematics.
  • {1, 2, 3} \ {2, 3, 4} = {1}
  • {1, 2, 3} \ {bear,hen,squirrel} = {1, 2, 3}
  • {cat, dog, canary} \ {wolf, canary, whale, cat} = {dog}

Proper Subsets ⊂

S ⊂ T, meaning that S is a proper subset of T, such that all elements in S are also elements of T. However, there will need to be at least one element in T, which is not in S. Examples:

  • N ⊂ Z, meaning that the set of natural numbers is a proper subset of the set of integers. In other words, all natural numbers are also integers. There are some elements in Z, which are not in N - these are the negative integers.
  • Z ⊂ Q, meaning that the set of integers is a proper subset of the set of rational numbers. In other words, all integers are also rational numbers. But there are rational numbers which are not integers.

Subsets ⊆

A ⊆ B, meaning that A forms a subset of B. In other words, A could be identical to B, but does not have to be! Examples: Let A be the set of all students in 6th form and B the set of all students taking computer science. ^A ⊆ B, meaning that A forms a subset of B. In other words, all students in the 6th form could be taking computer science. In which case the two sets would be the same, but they may not be! Now let A be all of the students present in school and B the set of students currently in the assembly hall.

  • A ⊆ B, meaning that A forms a subset of B. In other words, all students currently in present in school may all be located in the assembly hall, but they may not be!

Cartesian Product

In mathematics, a Cartesian product is a mathematical operation that returns a set (or product set or simply product) from multiple sets. That is, for sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B. The Cartesian Product of an empty set is an empty set. Products can be specified using set-builder notation, e.g.

Cartesianproduct.png

Compact Set Notation

Compact set notation is a useful tool to describe the properties of each element of a set, rather than writing out all elements of a set. The table below lists all of the necessary symbols for compact set notation.

Symbol Meaning
x ∈ S x is an element of S
| such that
and
or
< smaller than
<= smaller or equal than
> greater than
>= greater or equal than
A ⊆ B A is a subset of B
A ⊂ B A is a proper subset of B
C = A ∪ B C is the union of A and B
C = A ∩ B C is the intersection of A and B
C = A \ B C is the difference of A and B

Example 1:

We might want to describe a set A, which contains all natural numbers greater than 10. This means that we are interested in a subset of ℕ, such that each of its elements is greater than the value 10. Using set notation this set can be described as:

A = {x|x ∈ ℕ ∧ x> 10}

Example 2:

This time set A is described as natural numbers, which are multiples of 10. This can be notated as:

A = {10x|x ∈ ℕ}

Example 3:

The elements of set A are all even numbers, up to including 12. Set notation allows us to write this set as {0,2,4,6,8,10,12}, which in this example is feasible. However, this would be very impractical, if A is described as all even numbers, up to including 1,000,000!

A = {2x | x ∈ N &and 2x < = 1,000,000}

The table below explains how each expression is linked to the english description of the elements of the set A.

English description... ... expressed in set notation
x must be even 2x
such that 2x |
2x must be a natural number 2x ∈ N
2x must be smaller or equal than 1,000,000 2x < = 1,000,000

Example 4:

Set A is described as all the natural multiples of 3, below 5,000.

A = {3x | x ∈ N ∧ 3x > 5,000}

The table below explains how each expression is linked to the english description of the elements of the set A.

English description... ... expressed in set notation
x must be a multiple of 3 3x
such that |
x must be a natural number x ∈ N
3x must be smaller than 5,000 3x < 5000

Well Known Sets

  • Natural numbers N= {0, 1, 2, 3, …}
  • Integers Z= {…, -2, -1, 0, 1, 2, …}
  • Positive Integers Z+= {1, 2, 3, 4, …}
  • Real Numbers R= {47.3, -12, π, …}
  • Rational Numbers Q= {1.5, 2.6, -3.8, 15, …}
  • An Ordinal Number is a number that tells the position of something in a list, such as 1st, 2nd, 3rd, 4th, 5th etc.

Quiz

To see if you remember what you have learned about the Maths For Regular Expressions, take this quick test either by yourself, or with friends: https://play.kahoot.it/#/?quizId=3b146531-33b6-41f2-8f56-d0a6423401e9