Table of Contents


DATA ANALYTICS REFERENCE DOCUMENT

Document Title:46887 - Computational Thinking with Algorithms
Document No.:1548328636
Author(s):Rita Raher, Gerhard van der Linde
Contributor(s):

REVISION HISTORY


Revision


Details of Modification(s)

Reason for modification

Date

By
0 Draft release46887 Module summary reference page 2019/01/24 11:17 Gerhard van der Linde

46887 - Thinking with Algorithms

Module learning outcomes

On completion of this module the learner will/should be able to

  1. Apply structured methodologies to problem solving in computing.
  2. Design algorithms to solve computational problems.
  3. Critically evaluate and assess the performance of algorithms.
  4. Translate real-world problems into computational problems

Indicative Syllabus

External Resources & Further Reading

Websites:

Videos:

Books:

Documentaries:

Papers:

Week 1 - Introduction

01 Introduction

01 Intrduction

02 Review of Programming and Mathematical Concepts

:02 Review of Programming and Mathematical Concepts

Mathematical operators

Operator Description Examples
+ Additive operator (also string concatenation) 2 + 1 = 3 “abc” + “_” + 123 = “abc_123”
- Subtraction operator 25 – 12 = 13
* Multiplication operator 2 * 25 = 50
/ Division operator 35 / 5 = 7
% Remainder operator 35 % 5 = 0, 36 % 5 = 1

Order of operations - BEMDAS

Exponents

indicate that a quantity is to be multiplies by itself some number of times

Variables

A variable is simply a storage location and associated name which can use to store some information for later use

Type of variables

Data Types

Strongly and weak typed

Common operators

Operator Description Examples
= Assignment operator int number = 23; string myWord = “apple”;
++ Increment operator; increments a value by 1 int number = 23; number++; System.out.println(number); prints 24
Decrement operator; decrements a value by 1 int number = 23; number–; System.out.println(number); prints 22
+= Assignment (shorthand for number = number + value) int number = 23; number += 2; System.out.println(number); prints 25
-= Assignment (shorthand for number = number - value) int number = 23; number -= 2; System.out.println(number); prints 21
== equality, equal 2==1 false
!= equality, not equal 2!=1 true
&& Logical AND 2==1 && 1==1 false
|| Logical OR 2==1 || 1==1 true
! inverts the value of the Boolean!success
> Relational, greater than
>= Relational, greater than or equal to
< Relational, less than
Relational, less than or equal to

Functions

Control structures

Sequential

Selection

Iteration

Data structure

array
list
cars = [“Ford”, “Volvo”, “BMW”]

Week 2 - Analysing Algorithms - Part 1

Analysing Algorithms - Part 1

03 Analysing Algorithms Part 1

Roadmap

Recap

Features of a well-designed algorithm

Efficiency

Analysing efficiency

Complexity

Performance vs. complexity

It is important to differentiate between:

Note that complexity affects performance but not the other way around

Comparing complexity

Orders of magnitude

  • Order of Magnitude is a mathematical term which basically tells the difference between classes of numbers
  • Think of the difference between a room with 10 people, and the same room with 100 people. These are different orders of magnitude
  • However, the difference between a room with 100 people and the same room with 101 people is barely noticeable. These are of the same order of magnitude
  • How many gumballs in this gumball machine?
  • If we guess 200, that’s probably pretty close
  • Whereas a guess of 10,000 would be way off

Complexity families

Evaluating complexity

Best, average and worst cases

Week 3 - Analysing Algorithms Part 2

Analysing Algorithms Part 2

Analysing Algorithms Part 2

Roadmap

Review of complexity

Comparing growth functions

value of <m>n</m>constant<m>log n</m><m>n</m>(linear)<m>n log n</m>(linearithmic)<m>n2</m>(quadratic)<m>n3</m>(cubic)<m>2n</m>(exponential)
81382464512256
16141664256409665536
3215321601024327684294967296
64166438440962621441.84467E+19
128171288961638420971523.40282E+38
25618256204865536167772161.15792E+77
5121951246082621441342177281.3408E+154

Best, worst and average cases

Worst case

Big O notation

Formally

Suppose 𝑓(𝑥) and 𝑔(𝑥) are two functions defined on some subset of the set of real numbers

<m 18>f(x)=O(g(x)) for x right infinity</m>

if and only if there exist constants 𝑁 and 𝐶 such that

<m 18>delim{|}{f(x)}{|}⇐C delim{|}{g(x)}{|} for x all x > N</m>

Intuitively, this means that 𝑓 does not grow more quickly than 𝑔

(Note that N is size of the input set which is large enough for the higher order term to begin to dominate)

Tightest upper bound

Ω (omega) notation

Θ (theta) notation

Separating an algorithm and its implementation

Evaluating complexity

Passing an array to a method in Java and Python

// passing an array of integers into a method in Java 
void myMethod(int[] elements) {
    // do something with the data here 
}
def my_function(elements):
    # do something with the data here
 
test 1 = [5, 1, 12, -5, 16]
my_function(test1)

Array with 5 elements: <m 30>tabular{11}{111111}{5 1 12 {-5} 16}</m>

O(1) example

Java Code

boolean isFirstElementTwo(int[] elements) 
{
    if(elements[0] == 2) { 
        return true; 
    } 
    else { 
        return false; 
    } 
}

Python Code

# O(1) Example
def is_first_el_two(elements):
    if elemenents[0] == 2:
        return True
    return False
 
test1 = [2, 1, 0, 3]
test2 = [0, 2, 3, 4]
 
print(is_first_el_two(test1))  # prints True
print(is_first_el_two(test1))  # prints False

O(n) example

Java Code

boolean containsOne(int[] elements) { 
    for (int i=0; i<elements.size(); i++){ 
        if(elements[i] == 1) { 
            return true; 
        } 
    } 
    return false; 
}

Python Code

# O(n) Example
def contains_one(elements)
    for i in range(0, len(elements)):
        if elements[i] == 1:
            return True
    return False
 
test1 = [0, 2, 1, 2]
test2 = [0, 2, 3, 4]
 
print(contains_one(test1))  # prints True
print(contains_one(test1))  # prints False

O(n^2) example

Java Code

boolean containsDuplicates(int[] elements) 
{ 
  for (int i=0; i<elements.length; i++){ 
    for (int j=0; j<elements.length; j++){ 
      if(i == j){ // avoid self comparison continue; 
      } 
      if(elements[i] == elements[j]) { 
        return true; // duplicate found 
      } 
    } 
  } 
  return false; 
}

Python Code

# O(n^2) Example
def contains_duplicates(elements):
  for i in range (0, len(elements)):
    for j in range(0, len(elements)):
      if i == j: ##avoid self comparison
        continue
      if elements[i] == elements[j]:
        return True # duplicate found
  return False
 
test1 = [0, 2, 1, 2]
test2 = [1, 2, 3, 4]
 
print(contains_duplicates(test1))  # prints True
print(contains_duplicates(test2))  # prints False

Week 4 - Recursive Algorithms Part 1

Recursive Algorithms Part 1

Lecture Notes

05_recursive_algorithms_part_1.pdf

Socratica Third Part Video

The video and the code is a very condensed summary of the following section and not part of the syllabus, you can skip to Iteration and recursion

Socratica - Recursion, the Fibonacci Sequence and Memoization

Code Summary from the video

memoization.py
  1. from functools import lru_cache
  2.  
  3. #@lru_cache(maxsize=1000)
  4. def fibonacci(n):
  5. if n == 1:
  6. return 1
  7. elif n == 2:
  8. return 1
  9. elif n > 2:
  10. return fibonacci(n-1) + fibonacci(n-2)
  11.  
  12. for n in range(1, 1001):
  13. print(n, ":", fibonacci(n))

Run the following code with the highlighted line commented and not commented and observe the difference. The video link above the code explains the details.

Roadmap

Iteration and recursion

Recursion

Simple Recursion Program

Java code

void main(){
	count(0);
}
 
void count(int index){
	print(index);
	if(index <2){
		count(index+1);
	}
}

Python code

def count(index):
  print(index)
  if index < 2:
    count(index + 1)
 
count(0)    # outputs 0 1 2 

Recursion trace for the call count(0)

Stacks

Stacks and recursion

Why use recursion?

Recursion vs iteration

Types of recursion

Tail Recursions

Infinite recursion

Java code

void infinite(int x){
	infinite(x-1);
}

Pythoncode

def infinite(x):
  infinite(x-1)
 
infinite(1)
# RecusrionError:
# maximimum recursion depth exceeded

Circular Recursion

Java code

void circular(int x){
	if(x==1){
	  circular(x+1);
	}
	circular(x-1);
}

Python code

def circular(x):
  if x ==1:
    circilar(x + 1)
  circular(x - 1)
 
circilar(1)    # RecursioError:
# maximum recusrion depth exceeded
# in comparison

Rules for receive algorithms

  1. Base case: a recursive algorithm must always have a base case which can be solved without recursion. Methods without a base case will results in infinite recursion when run.
  2. Making progress: for cases that are to be solved recursively, the next recursive call must be a case that makes progress towards the base case. Methods that do not make progress towards the base case will results in circular recursion when run.
  3. Design rule: Assume that all the recursive calls work.
  4. Compound interest rule: Never duplicate work by solving the same instance of a problem in separate calls.

Designing Recursive Algorithms

Recap

Week 5 - Recursive Algorithms Part 2

Roadmap

Review of recursion

Rules for recursive algorithms

  1. Base case: a recursive algorithm must always have a base case which can be solved without recursion. Methods without a base case will result in infinite recursion when run.
  2. Making progress: for cases that are to be solved recursively, the next recursive call must be a case that makes progress towards the base case. Methods that do not make progress towards the base case will result in circular recursion when run.
  3. Design rule: assume that all the recursive call work
  4. Compound interest rule: Never duplicate work by solving the same instance of a problem in separate recursive calls.

Factorials

Computing a factorial

iterative implementation

def factorial(n):
	answer = 1
	while n > 1:
		answer *= n
		n -=1
	return answer
 
print(factorial(5))
#prints 120

Recursive implementation

def factorial_rec(n):
	if n <1:
		return 1
	else: 
		return n*factorial_rec(n-1)
 
print(factorial_rec(5))
#print 120

Greatest common Divisor

Computing the greatest common divisor

iterative implementation

def euclid(a, b):
	while b != 0:
		temp = b
		b = a% b
		a = temp
	return a
 
print(euclid(35, 49))
# print 7

Recursive implementation

def euclid (a, b):
	if b == 0:
		return a
	else:
		return euclid(b, a%b)
print(euclid(35, 49))
# prints 7

Fibonacci series

Computing the nth Fibonacci number

Iterative implementation

def fib(n):
	i, n1, n2 =1, 0, 1
	while i <n:
		temp =n1
		n1 = n2
		n2 =n1 +temp
		i = 1+ i
	return n1
 
print(fib(5))
#prints 3

Recursive implementation

def fib(n):
	if n==1:
		return 0
	elif n ==2:
		return 1
	return fib(n-1) +fib(n-2)
 
print(fib(5))
#prints 3

Week 6 - Cryptography

Overview

07_algorithms_cryptography_ssl_2019_part1.pdf

Introduction to Cryptography

This is somewhat surprising…..

problems for which no good algorithms are known are crucial here.

Cryptography (Problem Statement)

Cryptography (Some definitions)

A general encryption procedure(left ) and decryption procedure (right) is as follows:

<m 20>C = Encr(P)</m> and <m 20>P = Decr(C)</m>

Symmetric Cryptography

Symmetric Cryptography - Simple Example

Substitution Cipher

<insert image>

EG: Hello - > khoor

So from example above, this kind of cypher is easily broken…

Symmetric Cryptography - Examples continued...

Polyalphabetic substitution:

Polyalphabetic substitution continued

Decryption Algorithm

Cryptography: Algorithms and keys

Cryptographic keys

…and each letter of the phrase could take on 1 of 26 possible numeric values.

Ours are (t =20, h= 8 etc…)

As we had 10 characters in our phrase, and each character could be any 1 of 26 possible values, our potential key space is <m>26^10</m> i.e there are 141,167,095,653,376 possible keys or “phrases”.

Some limitations of Symmetric Cryptography

Cryptography: Key Distribution

Asymmetric Cryptography: Public Key Cryptography

Public Key Cryptography

A little bit of number theory

Prime number: A number is prime if it is greater than 1 and if its only factors are itself and 1.

RSA Algorithm

(Foundation is a number theory in mathematics, based on Euler's generalisation of Fermat's Theorem).

Format

A block of plaintext(message), M, gets transformed into a cipher block C

Requirements:

RSA: A (simple) example

e.g. PIN = 2345

“When sending us private data such as your PIN, encrypt it using RSA. This is the relevant public key information: e=7, n=33

Encrypting my "Plaintext" PIN: 2 3 4 5

Decrypting 29,9,16,14

Next steps

To do W6


DATA ANALYTICS REFERENCE DOCUMENT

Document Title:Document Title
Document No.:1552735583
Author(s):Gerhard van der Linde, Rita Raher
Contributor(s):

REVISION HISTORY


Revision


Details of Modification(s)

Reason for modification

Date

By
0 Draft releaseDocument description here 2019/03/16 11:26 Gerhard van der Linde, Rita Raher

Week 8 - Sorting Algorithms Part 1

Overview

  • Introduction to sorting
  • Conditions for sorting
  • Comparator functions and comparison-based sorts
  • Sort keys and satellite data
  • Desirable properties for sorting algorithms
    • Stability
    • Efficiency
    • In-place sorting
  • Overview of some well-known sorting algorithms
  • Criteria for choosing a sorting algorithm

Sorting

  • Sorting – arrange a collection of items according to some pre-defined ordering rules
  • There are many interesting applications of sorting, and many different sorting algorithms, each with their own strengths and weaknesses.
  • It has been claimed that as many as 25% of all CPU cycles are spent sorting, which provides a great incentive for further study and optimization
  • The search for efficient sorting algorithms dominated the early days of computing.
  • Numerous computations and tasks are simplified by properly sorting information in advance, e.g. searching for a particular item in a list, finding whether any duplicate items exist, finding the frequency of each distinct item, finding order statistics of a collection of data such as the maximum, minimum, median and quartiles.

Timeline of sorting algorithms

  • 1945 – Merge Sort developed by John von Neumann
  • 1954 – Radix Sort developed by Harold H. Seward
  • 1954 – Counting Sort developed by Harold H. Seward
  • 1959 – Shell Sort developed by Donald L. Shell
  • 1962 – Quicksort developed by C. A. R. Hoare
  • 1964 – Heapsort developed by J. W. J. Williams
  • 1981 – Smoothsort published by Edsger Dijkstra
  • 1997 – Introsort developed by David Musser
  • 2002 – Timsort implemented by Tim Peters

Sorting

  • Sorting is often an important step as part of other computer algorithms, e.g. in computer graphics (CG) objects are often layered on top of each other; a CG program may have to sort objects according to an “above” relation so that objects may be drawn from bottom to top
  • Sorting is an important problem in its own right, not just as a preprocessing step for searching or some other task
  • Real-world examples:
    • Entries in a phone book, sorted by area, then name
    • Transactions in a bank account statement, sorted by transaction number or date
    • Results from a web search engine, sorted by relevance to a query string

Conditions for sorting

  • A collection of items is deemed to be “sorted” if each item in the collection is less than or equal to its successor
  • To sort a collection A, the elements of A must be reorganised such that if A[i] < A[j], then i < j
  • If there are duplicate elements, these elements must be contiguous in the resulting ordered collection – i.e. if A[i] = A[j] in a sorted collection, then there can be no k such that i < k < j and A[i] ≠ A[k].
  • The sorted collection A must be a permutation of the elements that originally formed A (i.e. the contents of the collection must be the same before and after sorting)

Comparing items in a collection

  • What is the definition of “less than”? Depends on the items in the collection and the application in question
  • When the items are numbers, the definition of “less than” is obvious (numerical ordering)
  • If the items are characters or strings, we could use lexicographical ordering (i.e. apple < arrow < banana)
  • Some other custom ordering scheme – e.g. Dutch National Flag Problem (Dijkstra), red < white < blue

Comparator functions

  • Sorting collections of custom objects may require a custom ordering scheme
  • In general: we could have some function compare(a,b) which returns:
    • -1 if a < b
    • 0 if a = b
    • 1 if a > b
  • Sorting algorithms are independent of the definition of “less than” which is to be used
  • Therefore we need not concern ourselves with the specific details of the comparator function used when designing sorting algorithms

Inversions

  • The running time of some sorting algorithms (e.g. Insertion Sort) is strongly related to the number of inversions in the input instance.
  • The number of inversions in a collection is one measure of how far it is from being sorted.
  • An inversion in a list A is an ordered pair of positions (i, j) such that:
    • i < j but A[i] > A[j].
    • i.e. the elements at positions i and j are out of order
  • E.g. the list [3,2,5] has only one inversion corresponding to the pair (3,2), the list [5,2,3] has two inversions, namely, (5,2) and (5,3), the list [3,2,5,1] has four inversions (3,2), (3,1), (2,1), and (5,1), etc.

Comparison sorts

  • A comparison sort is a type of sorting algorithm which uses comparison operations only to determine which of two elements should appear first in a sorted list.
  • A sorting algorithm is called comparison-based if the only way to gain information about the total order is by comparing a pair of elements at a time via the order ≤.
  • Many well-known sorting algorithms (e.g. Bubble Sort, Insertion Sort, Selection Sort, Merge Sort, Quicksort, Heapsort) fall into this category.
  • Comparison-based sorts are the most widely applicable to diverse types of input data, therefore we will focus mainly on this class of sorting algorithms
  • A fundamental result in algorithm analysis is that no algorithm that sorts by comparing elements can do better than <m>n</m> log <m>n</m> performance in the average or worst cases.
  • Under some special conditions relating to the values to be sorted, it is possible to design other kinds of non-comparison sorting algorithms that have better worst-case times (e.g. Bucket Sort, Counting Sort, Radix Sort)

Sort keys and satellite data

  • In addition to the sort key (the information which we use to make comparisons when sorting), the elements which we sort also normally have some satellite data
  • Satellite data is all the information which is associated with the sort key, and should travel with it when the element is moved to a new position in the collection
  • E.g. when organising books on a bookshelf by author, the author’s name is the sort key, and the book itself is the satellite data
  • E.g. in a search engine, the sort key would be the relevance (score) of the web page to the query, and the satellite data would be the URL of the web page along with whatever other data is stored by the search engine
  • For simplicity we will sort arrays of integers (sort keys only) in the examples, but note that the same principles apply when sorting any other type of data

Desirable properties for sorting algorithms

  • Stability – preserve order of already sorted input
  • Good run time efficiency (in the best, average or worst case)
  • In-place sorting – if memory is a concern
  • Suitability – the properties of the sorting algorithm are well-matched to the class of input instances which are expected i.e. consider specific strengths and weaknesses when choosing a sorting algorithm

Stability

  • If a comparator function determines that two elements 𝑎𝑖 and 𝑎𝑗 in the original unordered collection are equal, it may be important to maintain their relative ordering in the sorted set
  • i.e. if i < j, then the final location for A[i] must be to the left of the final location for A[j]
  • Sorting algorithms that guarantee this property are stable
  • Unstable sorting algorithms do not preserve this property
  • Using an unstable sorting algorithm means that if you sort an already sorted array, the ordering of elements which are considered equal may be altered!

Stable sort of flight information

  • All flights which have the same destination city are also sorted by their scheduled departure time; thus, the sort algorithm exhibited stability on this collection.
  • An unstable algorithm pays no attention to the relationships between element locations in the original collection (it might maintain relative ordering, but it also might not).

Reference: 1)

Analysing sorting algorithms

  • When analysing a sorting algorithm, we must explain its best-case, worstcase, and average-case time complexity.
  • The average case is typically hardest to accurately quantify and relies on advanced mathematical techniques and estimation. It also assumes a reasonable understanding of the likelihood that the input may be partially sorted.
  • Even when an algorithm has been shown to have a desirable best-case, average-case or worst-case time complexity, its implementation may simply be impractical (e.g. Insertion Sort with large input instances).
  • No one algorithm is the best for all possible situations, and so it is important to understand the strengths and weaknesses of several algorithms.

Recap: orders of growth

Factors which influence running time

  • As well as the complexity of the particular sorting algorithm which is used, there are many other factors to consider which may have an effect on running time, e.g.
  • How many items need to be sorted
  • Are the items only related by the order relation, or do they have other restrictions (for example, are they all integers in the range 1 to 1000)
  • To what extent are the items pre-sorted
  • Can the items be placed into an internal (fast) computer memory or must they be sorted in external (slow) memory, such as on disk (so-called external sorting).

In-place sorting

  • Sorting algorithms have different memory requirements, which depend on how the specific algorithm works.
  • A sorting algorithm is called in-place if it uses only a fixed additional amount of working space, independent of the input size.
  • Other sorting algorithms may require additional working memory, the amount of which is often related to the size of the input n
  • In-place sorting is a desirable property if the availability of memory is a concern

Overview of sorting algorithms

AlgorithmBest caseWorst caseAverage caseSpace ComplexityStable?
Bubble Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Selection Sort <m>n2</m> <m>n2</m> <m>n2</m> 1 No
Insertion Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Merge Sort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>O(n)</m> Yes
Quicksort <m>n log n</m> <m>n2</m> <m>n log n</m> <m>n</m> (worst case) No*
Heapsort <m>n log n</m> <m>n log n</m> <m>n log n</m> 1 No
Counting Sort <m>n + k</m> <m>n + k</m> <m>n + k</m> <m>n + k</m> Yes
Bucket Sort <m>n + k</m> <m>n2</m> <m>n + k</m> <m>n * k</m> Yes
Timsort <m>n</m> <m>n log n</m> <m>n log n</m> <m>n</m> Yes
Introsort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>log n</m> No

*the standard Quicksort algorithm is unstable, although stable variations do exist

Criteria for choosing a sorting algorithm

Criteria Sorting algorithm
Small number of items to be sorted Insertion Sort
Items are mostly sorted already Insertion Sort
Concerned about worst-case scenarios Heap Sort
Interested in a good average-case behaviour Quicksort
Items are drawn from a uniform dense universe Bucket Sort
Desire to write as little code as possible Insertion Sort
Stable sorting required Merge Sort

Week 9: Sorting Algorithms Part 2

Overview

  • Review of sorting & desirable properties for sorting algorithms
  • Introduction to simple sorting algorithms
    • Bubble Sort
    • Selection sort
    • Insertion sort

Review of sorting

  • Sorting - Arrange a collection of items according to pre-defined ordering rule
  • Desirable properties for sorting algorithms
    • Stability - preserve order of already sorted input
    • Good run time efficiency(in the best, average or worst case)
    • In-place sorting - if memory is a concern
    • Suitability - the properties of the sorting algorithm are well-matched to the class of input instances which are expected i.e. consider specific strengths and weaknesses when choosing a sorting algorithm.

Overview of sorting algorithms

AlgorithmBest caseWorst caseAverage caseSpace ComplexityStable?
Bubble Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Selection Sort <m>n2</m> <m>n2</m> <m>n2</m> 1 No
Insertion Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Merge Sort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>O(n)</m> Yes
Quicksort <m>n log n</m> <m>n2</m> <m>n log n</m> <m>n</m> (worst case) No*
Heapsort <m>n log n</m> <m>n log n</m> <m>n log n</m> 1 No
Counting Sort <m>n + k</m> <m>n + k</m> <m>n + k</m> <m>n + k</m> Yes
Bucket Sort <m>n + k</m> <m>n2</m> <m>n + k</m> <m>n * k</m> Yes
Timsort <m>n</m> <m>n log n</m> <m>n log n</m> <m>n</m> Yes
Introsort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>log n</m> No

*the standard Quicksort algorithm is unstable, although stable variations do exist

Comparison sorts

  • A comparison sort is a type of sorting algorithm which uses comparison operations only to determine which of two elements should appear in a sorted list.
  • A sorting algorithm is called comparison-based if the only way to gain information about the total order is by comparing a pair of elements at a time via the order ≤
  • The simple sorting algorithms which we will discuss in this lecture (Bubble sort, insertion sort, and selection sort) all fall into this category.
  • A fundamental result in algorithm analysis is that no algorithm that sorts by comparing elements can do better than <m>n</m> log <m>n</m> performance in the average or worst cases.
  • Non-comparison sorting algorithms(e.g Bucket Sort, Counting Sort, Radix Sort) can have better worst-case times.

Bubble Sort

  • Named for the way larger values in a list “Bubble up” to the end as sorting takes place
  • Bubble sort was first analysed as early as 1956 (time complexity is <m>n</m> in best case, and <m>n^2</m> in worst and average cases)
  • Comparison-based
  • In-place sorting algorithm(i.e uses a constant amount of additional working space in addition to the memory required for the input)
  • Simple to understand and implement, but it is slow and impractical for most problems even when compared to Insertion sort.
  • Can be practical in some cases on data which is nearly sorted

Bubble Sort procedure

  • Compare each element(except the last one) with its neighbour to the right
    • if they are out of order, swap them
    • this puts the largest element at the very end
    • the last element is now in the correct and final place
  • Compare each element(except the last two) with its neighbour to the right
    • If they are out of order, swap them
    • This puts the second largest element next to last
    • The last two elements are now in their correct and final places.
  • Compare each element (except the last three) with its neighbour to the right
  • Continue as above until there are no unsorted elements on the left

Bubble Sort example

Bubble Sort in Code

public static void bubblesort(int[]a){
  int outer, inner;
  for(outer = a.length - 1; outer > 0;outer--){ //counting down
   for(inner=0;inner < outer;inner++){ //bubbling up
    if(a[inner] > a[inner+1];  { //if out of order....
    int temp = a[inner]; //...then swap
    a[inner] =a[inner+1];
    a[inner +1] = temp;
    }
   }
  }
 }
bubblesort.py
# Bubble Sort in python
def printArray(arr):
  print (' '.join(str(i) for i in arr))
 
 
def bubblesort(arr):
  for i in range(len(arr)):
    for j in range(len(arr) - i - 1):
      if arr[j] > arr[j + 1]:
        temp = arr[j]
        arr[j] = arr[j + 1]
        arr[j + 1] = temp
    # Print array after every pass
    print ("After pass " + str(i) + "  :", printArray(arr))
 
if __name__ == '__main__':
    arr = [10, 7, 3, 1, 9, 7, 4, 3]
    print ("Initial Array :", printArray(arr))
    bubblesort(arr)

Bubble Sort Example

Analysing Bubble Sort (worst case)

for(outer =a.length-1; outer >0; outer--){
 for(inner=0; inner < outer; inner++){
  if(a[inner]>a[inner+1]){
  //swap code omitted
  }
 }
}
  • In the worst case, the outer loop executes n-1 times (say n times)
  • On average, inner loop executes about n/2 times for each outer loop
  • In the inner loop, comparison and swap operations take constant time k
  • Result is:

Selection Sort

  • Comparison-based
  • In-place
  • Unstable
  • Simple to implement
  • Time complexity is <m>n^2</m> in best, worst and average cases.
  • Generally gives better performance than Bubble Sort, but still impractical for real world tasks with a significant input size
  • In every iteration of selection sort, the minimum element (when ascending order) from the unsorted subarray on the right is picked and moved to the sorted subarray on the left.

Selection Sort procedure

  • Search elements 0 through n-1 and select the smallest
    • swap it with the element in location 0
  • Search elements 1 through n-1 and select the smallest
    • swap it with the element in location 1
  • Search elements 2 through n-1 and select the smallest
    • swap it with the element in location 2
  • Search elements 3 through n-1 and select the smallest
    • swap it with the element in location 3
  • Continue in this fashion until there's nothing left to search

Selection Sort example

The element at index 4 is the smallest, so swap with index 0
The element at index 2 is the smallest, so swap with index 1
The element at index 3 is the smallest, so swap with index 2
The element at index 3 is the smallest, so swap with index 3

Selection sort might swap an array element with itself; this is harmless, and not worth checking for

Selection Sort in Code

public static void selectionsort(int[]a){
 int outer=0, inner=0, min=0;
 for(outer = 0; outer <a.length-1;outer++){  //outer counts up
 min = outer;
 for(inner = outer +1; inner <a.length; inner++){
  if(a[inner]<a[min]){ //find index of smallest value
   min = inner;
   }
  }
  //swap a [min] with a [outer]
  int temp = a[outer];
  a[outer] = a[min];
  a[min] = temp;
 }
}
selection_sort.py
# selection sort in python
def printArray(arr):
  return (' '.join(str(i) for i in arr))
 
 
def selectionsort(arr):
  N = len(arr)
  for i in range(0, N):
    small = arr[i]
    pos = i
    for j in range(i + 1, N):
      if arr[j] < small:
        small = arr[j]
        pos = j
    temp = arr[pos]
    arr[pos] = arr[i]
    arr[i] = temp
    print ("After pass " + str(i) + "  :", printArray(arr))
 
if __name__ == '__main__':
    arr = [10, 7, 3, 1, 9, 7, 4, 3]
    print ("Initial Array :", printArray(arr))
    selectionsort(arr)

Analysing Selection Sort

  • The outer loop runs <m>n</m> - 1 times
  • The inner loop executes about <m>n</m>/2 times on average(from <m>n</m> to 2 times)
  • Results is:

in best, worst and average cases

Insertion Sort

  • Similar to the method usually used by card players to sort cards in their hand.
  • Insertion sort is easy to implement, stable, in-place, and works well on small lists and lists that are close to sorted.
  • On data sets which are already substantially sorted it runs in n +d time, where d is the number of inversions.
  • However, it is very inefficient for large random lists.
  • Insertion Sort is iterative and works by splitting a list of size n into a head(“sorted”) and tail(“unsorted”) sublist.

Insertion Sort procedure

  • Start from the left of the array, and set the “key” as the element at index 1.Move any elements to the left which are > the “key” right by one position, and insert the “key”.
  • Set the “Key” as the element at index 2. Move any elements to the left which are > the key right by one position and insert the key.
  • Set the “key” as the element at the index 3. Move any elements to the left which are > the key right by one position and index the key.
  • Set the “key” as the elements at index <m>n</m>-1. Move any elements to the left which are > the key right by one position and insert the key.
  • The array is now sorted.

Insertion Sort example

a[1]=5 is the key; 7>5 so move 7 right by one position, and insert 5 at index 0
a[2]=2 is the key; 7>2 so move both 7 and 5 right by one position, and insert 2 at index 0
a[3]=3 is the key; 7>3 and 5>3 so move both 7 and 5 right by one position, and insert 3 at index 1
a[4]=1 is the key; 7>1, 5>1, 3>1 and 2>1 so move both 7, 5, 3 and 2 right by one position, and insert 1 at index 1

(done)

Insertion Sort in code

public static void insertionsort(int a[]){
 for(int i=1; i<a.length; i++){
  int key =a[i]; //value to be inseted
  int j = i-1;
  //move all elements > key right one position
  while(j>=0 && a[j]>key){
  a[j+1]= a[j];
  j=j-1;
  }
  a[j+1]=key; //insert key in its new position
 }
 
}
insertion_sort.py
# insertion sort
def printArray(arr):
  return(' '.join(str(i) for i in arr))
 
 
def insertionsort(arr):
  N = len(arr)
  for i in range(1, N):
    j = i - 1
    temp = arr[i]
    while j >= 0 and temp < arr[j]:
      arr[j + 1] = arr[j]
      j -= 1
    arr[j + 1] = temp
    print ("After pass " + str(i) + "  :", printArray(arr))
 
if __name__ == '__main__':
    arr = [10, 7, 3, 1, 9, 7, 4, 3]
    print ("Initial Array :", printArray(arr))
    insertionsort(arr)

Analysing Insertion Sort

  • The total number of data comparisons made by insertion sort is the number of inversions d plus at most <m>n</m> -1
  • A sorted list has no inversions - therefore insertion sort runs in linear Ω(n) time in the best case(when the input is already sorted)
  • On average, a list of size <m>n</m> has inversions, and the number of comparisons is
  • In the worst case, alist of size n has inversions(reserve sorted input), and the number of comparisons is

Comparison of Simple sorting algorithms

  • The main advantage that Insertion sort has over Selection Sort is that the inner loop only iterates as long as is necessary to find the insertion point.
  • In the worst case, it will iterate over the entire sorted part. In the case, the number of iterations is the same as for selection sort and bubble sort.
  • At the other extreme, however, if the array is already sorted, the inner loop won't need to iterate at all. In this case, the running time is Ω(n), which is the same as the running time of Bubble sort on an array which is already sorted.
  • Bubble Sort, Selection sort and insertion sort are all in-place sorting algorithms.
  • Bubble sort and insertion sort are stable, whereas selection sort

Criteria for choosing a sorting algorithm

Criteria Sorting algorithm
Small number of items to be sorted Insertion Sort
Items are mostly sorted already Insertion Sort
Concerned about worst-case scenarios Heap Sort
Interested in a good average-case behaviour Quicksort
Items are drawn from a uniform dense universe Bucket Sort
Desire to write as little code as possible Insertion Sort
Stable sorting required Merge Sort

Recap

  • Bubble sort, selection sort and insertion sort are all O(<m>n^2</m>) in the worst case
  • It is possible to do much better than this even with comparison-based sorts, as we will see in the next lecture
  • from this lectuure on simple O(<m>n^2</m>)sorting algorithms:
    • Bubble sort is extremely slow, and is of little practical use
    • Selection sort is generally better than Bubble sort
    • Selection sort and insertion sort are “good enough” for small input instances
    • Insertion sort is usually the fastest of the three. In fact, for small <m>n</m> (say 5 or 10 elements), insertion sort is usually fasters than more complex algorithms.

Week 10: Sorting Algorithms Part 3

Overview

  • Efficient comparison sort
    • Merge sort
    • Quick sort
  • Non-comparison sorts
    • Counting sort
    • Bucket sort
  • Hybrid Sorting algorithms
    • Introsort
    • Timsort

Overview of sorting algorithms

AlgorithmBest caseWorst caseAverage caseSpace ComplexityStable?
Bubble Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Selection Sort <m>n2</m> <m>n2</m> <m>n2</m> 1 No
Insertion Sort <m>n</m> <m>n2</m> <m>n2</m> 1 Yes
Merge Sort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>O(n)</m> Yes
Quicksort <m>n log n</m> <m>n2</m> <m>n log n</m> <m>n</m> (worst case) No*
Heapsort <m>n log n</m> <m>n log n</m> <m>n log n</m> 1 No
Counting Sort <m>n + k</m> <m>n + k</m> <m>n + k</m> <m>n + k</m> Yes
Bucket Sort <m>n + k</m> <m>n2</m> <m>n + k</m> <m>n * k</m> Yes
Timsort <m>n</m> <m>n log n</m> <m>n log n</m> <m>n</m> Yes
Introsort <m>n log n</m> <m>n log n</m> <m>n log n</m> <m>log n</m> No

*the standard Quicksort algorithm is unstable, although stable variations do exist

Merge sort

  • Proposed by john von Neumann in 1945
  • This algorithm exploits a recursive divide-and conquer approach resulting in a worst-case running time of 𝑂(<m>n log n</m> ), the best asymptotic behaviour which we have seen so far.
  • It's best, worst, and average cases are very similar, making it a very good choice if predictable runtime is important - Merge Sort gives good all-round performance.
  • Stable sort
  • Versions of merge Sort are particularly good for sorting data with slow access times, such as data that cannot be held in internal memory(RAM) or are stored in linked lists.

Merge Sort Example

  • Mergesort is based on the following basic idea:
    • if the size of the list is 0 or 1, return.
    • Otherwise, separate the list into two lists of equal or nearly equal size and recursively sort the first and second halves separately.
    • finally, merge the two sorted halves into one sorted list.
  • Clearly, almost all the work is in the merge step, which should be as efficient as possible.
  • Any merge must take at least time that is linear in the total size of the two lists in the worst case, since every element must be looked at in order to determine the correct ordering.

Merge Sort in code

merge_sort.py
  1. # Merge sort
  2. def mergesort(arr, i, j):
  3. mid = 0
  4. if i < j:
  5. mid = int((i + j) / 2)
  6. mergesort(arr, i, mid)
  7. mergesort(arr, mid + 1, j)
  8. merge(arr, i, mid, j)
  9.  
  10.  
  11. def merge(arr, i, mid, j):
  12. print ("Left: " + str(arr[i:mid + 1]))
  13. print ("Right: " + str(arr[mid + 1:j + 1]))
  14. N = len(arr)
  15. temp = [0] * N
  16. l = i
  17. r = j
  18. m = mid + 1
  19. k = l
  20. while l <= mid and m <= r:
  21. if arr[l] <= arr[m]:
  22. temp[k] = arr[l]
  23. l += 1
  24. else:
  25. temp[k] = arr[m]
  26. m += 1
  27. k += 1
  28.  
  29. while l <= mid:
  30. temp[k] = arr[l]
  31. k += 1
  32. l += 1
  33. while m <= r:
  34. temp[k] = arr[m]
  35. k += 1
  36. m += 1
  37. for i1 in range(i, j + 1):
  38. arr[i1] = temp[i1]
  39. print ("After Merge: " + str(arr[i:j + 1]))
  40.  
  41. if __name__ == '__main__':
  42. arr = [9, 4, 8, 3, 1, 2, 5]
  43. print ("Initial Array: " + str(arr))
  44. mergesort(arr, 0, len(arr) - 1)

Terminal Output:

Initial Array: [9, 4, 8, 3, 1, 2, 5]
Left: [9]
Right: [4]
After Merge: [4, 9]
Left: [8]
Right: [3]
After Merge: [3, 8]
Left: [4, 9]
Right: [3, 8]
After Merge: [3, 4, 8, 9]
Left: [1]
Right: [2]
After Merge: [1, 2]
Left: [1, 2]
Right: [5]
After Merge: [1, 2, 5]
Left: [3, 4, 8, 9]
Right: [1, 2, 5]
After Merge: [1, 2, 3, 4, 5, 8, 9]

Quicksort

  • Developed by C.A.R Hoare in 1959
  • Like Merge SORT, Quicksort is a recursive Divide and Conquer algorithm
  • Standard version is not stable although stable versions do exist
  • Performance: worst case <m>n^2</m> (rare), average case <m>n log n</m>, best case <m>n log n</m>
  • Memory usage: O(<m>n</m>) (variants exist with O (n log n))
  • In practice it is one of the fastest known sorting algorithms, on average

Quicksort procedure

The main steps in Quick sort are:

  1. Pivot selection: Pick an element, called a “pivot” from the array
  2. Partioning: reorder the array elements with values < the pivot come beofre it, which all elements the values ≥ than the pivot come after it. After this partioining, the pivot is in its final position.
  3. Recursion: apply steps 1 and 2 above recursively to each of the two subarrays

The base case for the recursion is a subarray of length 1 or 0; by definition these cases do not need to be sorted.

Quicksort example

  • On overage quicksort runs in <m>n</m> log <m>n</m> but if it consistently chooses bad pivots, its performance degrades to <m>n^2</m>.
  • This happens if the pivot is consistently chosen so that all(or too many of) the elements in the array are < the pivot or > than the pivot. (A classic case is when the first or last element is chosen as a pivot and the data is already sorted, or nearly sorted).
  • Some options for choosing the pivot:
  • Always pick the first elements as the pivot.
  • Always pick the last elements as the pivot.
  • Pick a random element as the pivot.
  • Pick the median element as the pivot.

Quick Sort Code

quick_sort.py
  1. # Quick Sort
  2. def printArray(arr):
  3. return (' '.join(str(i) for i in arr))
  4.  
  5. def quicksort(arr, i, j):
  6. if i < j:
  7. pos = partition(arr, i, j)
  8. quicksort(arr, i, pos - 1)
  9. quicksort(arr, pos + 1, j)
  10.  
  11. def partition(arr, i, j):
  12. #pivot = arr[j] # pivot on the last item
  13. pivot = arr[int(j/2)] # pivot on the median
  14. small = i - 1
  15. for k in range(i, j):
  16. if arr[k] <= pivot:
  17. small += 1
  18. swap(arr, k, small)
  19.  
  20. swap(arr, j, small + 1)
  21. print ("Pivot = " + str(arr[small + 1]), " Arr = " + printArray(arr))
  22. return small + 1
  23.  
  24. def swap(arr, i, j):
  25. arr[i], arr[j] = arr[j], arr[i]
  26.  
  27. if __name__ == '__main__':
  28. arr = [9, 4, 8, 3, 1, 2, 5]
  29. print (" Initial Array :",printArray(arr))
  30. quicksort(arr, 0, len(arr) - 1)
 Initial Array : 9 4 8 3 1 2 5
Pivot = 5  Arr = 4 3 1 2 5 9 8
Pivot = 2  Arr = 1 2 4 3 5 9 8
Pivot = 3  Arr = 1 2 3 4 5 9 8
Pivot = 8  Arr = 1 2 3 4 5 8 9

Non-comparison Sorts

  • “Comparison sorts” make no assumptions about the data and compare all elements against each other (majority of sorting algortihms work in this way, including all sorting algorithms which we have discussed so far).
  • 𝑂(<m>n</m> log <m>n</m>) time is the ideal “worst-case” scenario for a comparison-based sort (i.e 𝑂(<m>n</m> log <m>n</m>)) is the smallest penalty you can hope for in the worst case). Heapsort has this behaviour.
  • <m>O(n)</m> time is possible if we make assumptions about the data and don't need to compare elements against each other (i.e., we know that data falls into a certain range or has some distribution).
  • Example of non-comparison sorts including Counting sort, Bucket and Radix Sort.
  • <m>O(n)</m> clearly is the minimum sorting time possible, since we must examine every element at least once (how can you sort an item you do not even examine?).

Counting Sort

  • Proposed by Harold H.Seward in 1954.
  • Counting Sort allows us to do something whihch seems impossible - sort a collection of items in (close to) linear time.
  • How is this possible? Several assumptions must be made about the types of input instances which the algorithms will have to handle.
  • i.e assume an input of size <m>n</m>, where each item has a non-negative integer key, with a range of k(if using zero-indexing, the keys are in the range [0,…,k-1])
  • Best-, worst- and average-case time complexity of n +k, space complexity is also n+k
  • The potential running time advantage comes at the cost of having an algorithm which is not a widely applicable as comparison sorts.
  • Counting sort is stable(if implemented in the correct way!)

Counting Sort procedure

  • Determine key range k in the input array(if not already known)
  • Initialise an array count size k, which will be used to count the number of times that each key value appears in the input instance.
  • Initialise an array result of size n, which will be used to store the sorted output.
  • Iterate through the input array, and record the number of times each distinct key values occurs in the input instance.
  • Construct the sorted result array, based on the histogram of key frequencies stored in count. Refer to the ordering of keys in input to ensure that stability is preserved.

Counting Sort example

Counting Sort Code

counting_sort.py
  1. # Counting sort
  2. def printArray(arr):
  3. return(' '.join(str(i) for i in arr))
  4.  
  5.  
  6. def countingsort(arr):
  7. count = [0] * 11 # can store the count of positive numbers <= 10
  8. N = len(arr)
  9. for i in range(0, N):
  10. count[arr[i]] += 1
  11. for i in range(1, len(count)):
  12. count[i] += count[i - 1]
  13. print ("Counting Array :",
  14. printArray(count))
  15. output = [0] * N
  16. for i in range(len(arr)):
  17. output[count[arr[i]] - 1] = arr[i]
  18. count[arr[i]] -= 1
  19. print ("After Sorting :",
  20. printArray(output))
  21.  
  22. if __name__ == '__main__':
  23. arr = [10, 7, 3, 1, 9, 7, 4, 3]
  24. print ("Initial Array :",
  25. printArray(arr))
  26. countingsort(arr)
Initial Array   : 10 7 3 1 9 7 4 3
Counting Array  : 0 1 1 3 4 4 4 6 6 7 8
After Sorting   : 1 3 3 4 7 7 9 10

Bucket Sort

  • Bucket sort is stable sort which works by distributing the elements of an array into a series of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sort algorithm.
  • Bucket sort can be seen as generalization of counting osrt; in fact, if each bucket has size 1 then bucket sort degenerates to counting sort.
  • Time complexity is <m>n^2</m> in the worst case, and <m>n</m>+k in the best and average cases(where k is the number of buckets)
  • Worst case space complexity is 𝑂(<m>n</m> + k)
  • Bucket sort is useful when input values are uniformly distributed over a range e.g when sorting a large set of floating point numbers which values are uniformly distributed between 0.0 and 1.0
  • Bucket Sort's performance degrades with clustering; if many values occur close together, they will all fall into a single buckets and be sorted slowly.

Bucket Sort procedure

  • Set up an array of “Buckets”, which are initially empty
  • Iterate through the input array, placing each element into its correct buckets
  • Sort each non-empty bucket (using either a recursive call to bucket sort, or a different sorting algorithm e.g Insertion Sort)
  • Visit the buckets in order, and place each elements back into its correct position.

Bucket Sort example

2)

Hybrid Sorting Algorithms

  • A hybrid algorithm is one which combines two or more algorithms which are designed to solve the same problem.
  • Either chooses one specific algorithms depending on the data and execution conditions, or switches between different algorithms according to some rule set.
  • Hybrid algorithms aim to combine the desired features of each constituent algorithms, to achieve a better algorithm in aggregate.
  • E.g The best versions of Quicksort perform better than either Heap Sort or Merge Sort on the vast majority of inputs. However, Quicksort has poor worst-case running time (𝑂(<m>n^2</m>)) and <m>O(n)</m> stack usage. By comparison, both Heap sort and Merge Sort have 𝑂(<m>n log n</m>) worst-case running time, together with a stack usage of 𝑂(1) for Heap Sort or <m>O(n)</m> for Merge Sort. Furthermore, Insertion Sort performs better than any of these algorithms on small data sets.

Introsort

  • Hybrid sorting algorithms proposed by David Musser in 1997.
  • Variation of Quicksort which monitors the recursive depth of the standard Quicksort algorithm to ensure efficient processing.
  • If the depth of the quicksort recursion exceeds <m>log n</m> levels, then Introsort switches to Heap sort instead.
  • Since both algorithms which it uses are comparison-based, IntroSort is also comparison-based.
  • Fast average- and worst-case performance i.e. <m>n log n</m>

Timsort

  • Hybrid sorting algorithm Implemented by Tim Peters in 2002 for use in the python language.
  • Derived from a combination of merge Sort and insertion sort, along with additional logic (including binary search)
  • Finds subsequences (runs) of the data that are already ordered, and uses that knowledge to sort the reminder more efficiently, by merging an identified run with existing runs until certain criteria are fulfilled.
  • Used on the android platform, python(since 2.3) for arrays of primitive type in Java SE 7, and in the GNU Octave software.

Criteria for choosing a sorting algorithm

Criteria Sorting algorithm
Small number of items to be sorted Insertion Sort
Items are mostly sorted already Insertion Sort
Concerned about worst-case scenarios Heap Sort
Interested in a good average-case behaviour Quicksort
Items are drawn from a uniform dense universe Bucket Sort
Desire to write as little code as possible Insertion Sort
Stable sorting required Merge Sort

Conclusion

  • As we have seen, there are many different sorting algorithms, each of which has it own specific strengths and weaknesses.
  • Comparison-based sorts are the most widely applicable; but are limited to <m>n log n</m> running time in the best case
  • Non-Comparison sorts can achieve linear <m>n</m> running time in the best case, but are less flexible
  • Hybrid sorting algorithms allow us to leverage the strengths of two or more algorithms (e.g. Timsort = Merge sort + insertion sort)
  • There is no single algorithm which is best for all input instances; therefore it is important to use what you know about the expected input when choosing an algorithm.

Week 11 - Searching Algorithms

Overview

  • The Search Problem
  • Performance of Search Algorithms
  • Types of Search Algorithms
  • Linear Search (& how to beat it)
  • Binary Search

The Search Problem

  • Searching is a fundamental operation in computing.
  • E.g. finding information on the web, looking up a bank account balance, finding a file or application on a PC, querying a database…
  • The Search Problem relates to retrieving information from a data structure (e.g. Arrays, Linked Lists, Search Trees, Hash Tables etc.).
  • Algorithms which solve the Search Problem are called Search Algorithms.
  • Includes algorithms which query a data structure, such as the SQL SELECT command.
  • Two fundamental queries about any collection C:
    • Existence: Does C contain a target element? Given a collection C, we often simply want to know whether the collection already contains a given element t. The response to such a query is true if an element exists in the collection that matches the desired target t, or false if this is not the case.
    • Associative lookup: Return information associated in collection C with a target key value k. A key is usually associated with a complex structure called a value. The lookup retrieves or replaces this value.
  • A correct Search Algorithm should return true or the index of the requested key if it exists in the collection, or -1/null/false if it does not exist in the collection.

Reference: Pollice G., Selkow S. and Heineman G. (2016). Algorithms in a Nutshell, 2nd Edition. O' Reilly.

Performance of Search Algorithms

  • Ultimately the performance of a Search Algorithm is based on how many operations performed (elements the algorithm inspects) as it processes a query (as we saw with Sorting Algorithms).
  • Constant <m>O(n)</m> time is the worst case for any (reasonable) searching algorithm – corresponds to inspecting each element in a collection exactly once.
  • As we will see, sorting a collection of items according to a comparator function (some definition of “less than”) can improve the performance of search queries.
  • However, there are costs associated with maintaining a sorted collection, especially for use cases where frequent insertion or removal of keys/elements is expected.
  • Trade-off between cost of maintaining a sorted collection, and the increased performance which is possible because of maintaining sorted data.
  • Worth pre-sorting the collection if it will be searched often.

Pre-sorted data

  • Several search operations become trivial when we can assume that a collection of items is sorted.
  • E.g. Consider a set of economic data, such as the salary paid to all employees of a company. Values such as the minimum, maximum, median and quartile salaries may be retrieved from the data set in constant <m>O(1)</m> time if the data is sorted.
  • Can also apply more advanced Search Algorithms when data is presorted.

Types of Search Algorithms

  • Linear: simple (naïve) search algorithm
  • Binary: better performance than Linear
  • Comparison: eliminate records based on comparisons of record keys
  • Digital: based on the properties of digits in record keys
  • Hash-based: maps keys to records based on a hashing function

As we saw with Sorting Algorithms, there is no universal best algorithm for every possible use case. The most appropriate search algorithm often depends on the data structure being searched, but also on any a priori knowledge about the data.

Note: Only linear and binary covered in this module.

  • Linear Search (also known as Sequential Search) is the simplest Searching Algorithm; it is trivial to implement.
  • Brute-force approach to locate a single target value in a collection.
  • Begins at the first index, and inspects each element in turn until it finds the target value, or it has inspected all elements in the collection.
  • It is an appropriate algorithm to use when we do not know whether input data is sorted beforehand, or when the collection to be searched is small.
  • Linear Search does not make any assumptions about the contents of a collection; it places the fewest restrictions on the type of elements which may be searched.
  • The only requirement is the presence of a match function (some definition of “equals”) to determine whether the target element being searched for matches an element in the collection.
  • Perhaps you frequently need to locate an element in a collection that may or may not be ordered.
  • With no further knowledge about the information that might be in the collection, Linear Search gets the job done in a brute-force manner.
  • It is the only search algorithm which may be used if the collection is accessible only through an iterator.
  • Constant O(1) time in the best case, linear O(𝑛) time in the average and worst cases. Assuming that it is equally likely that the target value can be found at each array position, if the size of the input collection is doubled, the running time should also approximately double.
  • Best case occurs when the element sought is at the first index, worst case occur when the element sought is at the last index, average case occurs when the element sought is somewhere in the middle.
  • Constant space complexity.

Linear Search examples

Search the array a for the target value 7:

  1. a[0] = 2; this is not the target value, so continue
  2. a[1] = 1; this is not the target value, so continue
  3. a[2] = 5; this is not the target value, so continue
  4. a[3] = 4; this is not the target value, so continue
  5. a[4] = 8; this is not the target value, so continue
  6. a[5] = 7; this is the target value, so return index 5

Search the array a for the target value 3:

  1. a[0] = 2; this is not the target value, so continue
  2. a[1] = 1; this is not the target value, so continue
  3. a[2] = 5; this is not the target value, so continue
  4. a[3] = 4; this is not the target value, so continue
  5. a[4] = 8; this is not the target value, so continue
  6. a[5] = 7; this is not the target value, so continue
  7. a[6] = 9; this is not the target value, so continue
  8. Iteration is complete. Target value was not found so return -1

Linear Search in code

linear_search.py
  1. # Searching an element in a list/array in python
  2. # can be simply done using \'in\' operator
  3. # Example:
  4. # if x in arr:
  5. # print arr.index(x)
  6.  
  7. # If you want to implement Linear Search in python
  8.  
  9. # Linearly search x in arr[]
  10. # If x is present then return its location
  11. # else return -1
  12.  
  13. def search(arr, x):
  14.  
  15. for i in range(len(arr)):
  16.  
  17. if arr[i] == x:
  18. return i
  19.  
  20. return -1
  • Linear Search has linear time complexity:
    • Time <m>n</m> if the item is not found
    • Time <m>n/2</m>, on average, if the item is found
  • If the array is sorted, faster searching is possible
  • How do we look up a name in a phone book, or a word in a dictionary?
    • Look somewhere in the middle
    • Compare what’s there with the target value that you are looking for
    • Decide which half of the remaining entries to look at
    • Repeat until you find the correct place
    • This is the Binary Search Algorithm
  • Better than linear running time is achievable only if the data is sorted in some way (i.e. given two index positions, i and j, a[i] < a[j] if and only if i < j).
  • Binary Search delivers better performance than Linear Search because it starts with a collection whose elements are already sorted.
  • Binary Search divides the sorted collection in half until the target value is found, or until it determines that the target value does not exist in the collection.
  • Binary Search divides the collection approximately in half using whole integer division (i.e. 7/4 = 1, where the remainder is disregarded).
  • Binary Search has logarithmic time complexity and constant space complexity.

Binary Search examples

Search the array a for the target value 7:

  1. Middle index =(0+6)/2 = 3. a[3] = 6 < 7, so search in the right subarray (indices left=4 to right=6)
  2. Middle index =(4+6)/2 = 5. a[5] = 8 > 7, so search in the left subarray (indices left=4 to right=4)
  3. Middle index =(4+4)/2 = 4. a[4] = 7 = 7, so return index 4

Search the array a for the target value 3:

  1. Middle index =(0+6)/2 = 3. a[3] = 6 > 3, so search next in the left subarray (indices left=0 to right=2)
  2. Middle index =(0+2)/2 = 1. a[1] = 4 > 3, so search next in the left subarray (indices left=0 to right=1)
  3. Middle index =(0+1)/2 = 0. a[0] = 2 < 3, so search next in the right subarray (indices left=0 to right=0)
  4. Middle index =(0+0)/2 = 0. a[0] = 2 < 3. Now there is just one element left in the partition (index 0) – this corresponds to one of two possible base cases. Return -1 as this element is not the target value.

Iterative Binary Search in Code

iterative_binary_search.py
  1. # Iterative Binary Search Function
  2. # It returns location of x in given array arr if present,
  3. # else returns -1
  4. def binarySearch(arr, l, r, x):
  5.  
  6. while l <= r:
  7.  
  8. mid = l + (r - l)/2;
  9.  
  10. # Check if x is present at mid
  11. if arr[mid] == x:
  12. return mid
  13.  
  14. # If x is greater, ignore left half
  15. elif arr[mid] < x:
  16. l = mid + 1
  17.  
  18. # If x is smaller, ignore right half
  19. else:
  20. r = mid - 1
  21.  
  22. # If we reach here, then the element was not present
  23. return -1
  24.  
  25.  
  26. # Test array
  27. arr = [ 2, 3, 4, 10, 40 ]
  28. x = 10
  29.  
  30. # Function call
  31. result = binarySearch(arr, 0, len(arr)-1, x)
  32.  
  33. if result != -1:
  34. print "Element is present at index %d" % result
  35. else:
  36. print "Element is not present in array"

Recursive Binary Search in Code

recursive_binary_search.py
  1. # Python Program for recursive binary search.
  2.  
  3. # Returns index of x in arr if present, else -1
  4. def binarySearch (arr, l, r, x):
  5.  
  6. # Check base case
  7. if r >= l:
  8.  
  9. mid = l + (r - l)/2
  10.  
  11. # If element is present at the middle itself
  12. if arr[mid] == x:
  13. return mid
  14.  
  15. # If element is smaller than mid, then it can only
  16. # be present in left subarray
  17. elif arr[mid] > x:
  18. return binarySearch(arr, l, mid-1, x)
  19.  
  20. # Else the element can only be present in right subarray
  21. else:
  22. return binarySearch(arr, mid+1, r, x)
  23.  
  24. else:
  25. # Element is not present in the array
  26. return -1
  27.  
  28. # Test array
  29. arr = [ 2, 3, 4, 10, 40 ]
  30. x = 10
  31.  
  32. # Function call
  33. result = binarySearch(arr, 0, len(arr)-1, x)
  34.  
  35. if result != -1:
  36. print "Element is present at index %d" % result
  37. else:
  38. print "Element is not present in array"
  • In Binary Search, we choose an index that cuts the remaining portion of the array in half
  • We repeat this until we either find the value we are looking for, or we reach a subarray of size 1
  • If we start with an array of <m>size n</m>, we can cut it in half <m>log_2 n</m> times
  • Hence, Binary Search has logarithmic <m>(log n)</m> time complexity in the worst and average cases, and constant time complexity in the best case
  • For an array of size 1000, this is approx. 100 times faster than linear search (<m>2^10</m> ≈ 1000 when neglecting constant factors)

Conclusion

  • Linear Search has linear time complexity
  • Binary Search has logarithmic time complexity
  • For large arrays, Binary Search is far more efficient than Linear Search
  • However, Binary Search requires that the array is sorted
  • If the array is sorted, Binary Search is
    • Approx. 100 times faster for an array of size 1000 (neglecting constants)
    • Approx. 50 000 times faster for an array of size 1 000 000 (neglecting constants)
  • Significant improvements like these are what make studying and analysing algorithms worthwhile

Week 12: Benchmarking

Benchmarking Algorithms in Python

Overview

  • Motivation for benchmarking
  • Time in Python
  • Benchmarking a single run
  • Benchmarking multiple statistical runs

Motivation for Benchmarking

  • Benchmarking or a posteriori analysis is an empirical method to compare the relative performance of algorithms implementations.
  • Experimental (e.g running time) data may be used to validate theoretical or a priori analysis of algorithms
  • Various hardware and software factors such system architecture, CPU design, choice of Operating System, background processes, energy saving and performance enhancing technologies etc. can effect running time.
  • Therefore it is prudent to conduct multiple statistical runs using the same experimental setup, to ensure that your set of benchmarks are representative of the performance expected by an “average” user.

Time in Python

  • Dates and times in Python are presented as the number of seconds that have elapsed since midnight on January 1st 1970 (the “unix epoch”)
  • Each second since the Unix Epoch has a specific timestamp
  • Can import the time module in Python to work with dates and times
    • e.g start_time = time.time() # gets the current time in seconds
    • e.g 15555001605 is Thursday, 11 April 2019 16:53:25 in GMT

Benchmarking a single run

import time
 
#log the start time in seconds
start_time = time.time()
 
#call the function you want to be benchmarked
 
#log the start time in seconds
end_time = time.time()
 
#calculate the elapsed time
time_elapsed = end_time - start_time

Benchmarking multiple statistical runs

import time
 
num_runs = 10 # number of times to test the function
results = [] # array to store the results for each test
 
# benchmark the function
for r in range(num_runs):
   # log the start time in seconds
   start_time = time.time()
   # call the function that you want to benchmark
 
   # log the end time in seconds
   end_time = time.time()
 
   # calculate the elapsed time
   time_elapsed = end_time - start_time
 
   results.append(time_elasped)
benchmark.py
  1. from numpy.random import randint
  2. from time import time
  3.  
  4. def Bubble_Sort(array):
  5. # your sort code
  6. sorted=array
  7. return sorted
  8.  
  9. def Merge_Sort(array):
  10. # your sort code
  11. sorted=array
  12. return sorted
  13.  
  14. def Counting_Sort(array):
  15. # your sort code
  16. sorted=array
  17. return sorted
  18.  
  19. def Quick_Sort(array):
  20. # your sort code
  21. sorted=array
  22. return sorted
  23.  
  24. def Timsort(array):
  25. # your sort code
  26. sorted=array
  27. return sorted
  28.  
  29. # The actual names of your sort functions in your code somewhere, notice no quotes
  30. sort_functions_list = [Bubble_Sort, Merge_Sort, Counting_Sort, Quick_Sort, Timsort]
  31.  
  32. # The prescribed array sizes to use in the benchmarking testst
  33. for ArraySize in (100, 250, 500, 750, 1000, 1250, 2500, 3750, 5000, 6250, 7500, 8750, 10000):
  34. # create a randowm array of values to use in the tests below
  35. testArray = randint(1,ArraySize*2,ArraySize)
  36. # iterate throug the list of sort functions to call in the test
  37. for sort_type in sort_functions_list:
  38. # and test every function ten times
  39. for i in range(10):
  40. start=time()
  41. sort_type(testArray)
  42. stop=time()
  43. print('{} {} {} {}'.format(sort_type, ArraySize, i, (stop-start)*1000))
  44.  

Useful References

Python 3 time documentation: https://docs.python.org/3/library/time.html

Python Date and Time Overview https://www.tutorialspoint.com/python/python_date_time.htm

Discussions of benchmarking issues in Python (advanced material) https://www.peterbe.com/plog/how-to-do-performance-micro-benchmarks-in-python

Document Index

Table of Contents

1)
Pollice G., Selkow S. and Heineman G. (2016). Algorithms in a Nutshell, 2nd Edition. O' Reilly.