Last Updated: 8th February 2019
I recently came across a fun challenge on www.hackerrank.com under the graph theory category. In this post, I will break down and discuss a solution that I designed and implemented in Python. This solution will pass all of the unit tests and will receive the maximum score.
The style of code that I use was inspired by the book Algorithms by Robert Sedgewick and Kevin Wayne. They encourage object orientation and clean code. A agree with them that there is no reason not to use a clean coding style when solving algorithmic problems.
My intention in this post is to practice programming fundamentals with a focus on the process of problem solving and readable code. I consider these very important no matter what level you are at in your career as a programmer.
Before reading further with this editorial, I recommend that you go to the hackerrank website and try to answer the question yourself.
I like to start solving programming problems with the end in mind, and work backwards. I also regularly use my favourite problem solving technique: wishful thinking. With this technique, I write my code assuming certain classes and functions already exist. This allows me to structure my code in a way that makes it easy to understand and reason about. After I am convinced that the steps I have broken the problem down into are at an appropriate level of abstraction and are correct, I proceed to implement the missing classes and functions that I have wished into existence.
Using these two approaches, the problem can be broken down into three sub-problems:
Calculate the number of pairs: This will be the result which intuitively feels like a combinatorial calculation, which will require the number of astronauts for each nationality as input.
Get the number of astronauts of each nationality: To get the number of astronauts for each nationality for step 1, I have no choice but to use the input data which consists of ID’s and associations between those ID’s. If I imagine the pairs all connected to each other as given in the input, astronauts of each nationality will form separate networks or put another way disjoint subsets. Therefore some form of graph traversal will be necessary to count the sizes of the connected regions of the networks that correspond to each of the nationalities.
Construct a graph data structure: Since I expect to perform a graph traversal, I will need to transform the input data into a graph data structure that implements an appropriate API. The API should allow the processing I require in step 2 in a way that hides the implementation of the graph data structure. I also know that the type of graph should be of the undirected type. This is because
Hackerrank supplies a blank function definition which I choose to populate as follows:
def journeyToMoon(n, astronaut):
'''
Function signature supplied by Hackerrank,
and my implementation using supplied arguments: n, astronaut.
'''
graph = UndirectedGraph(n, astronaut)
subset_sizes = get_disjoint_subset_sizes(graph)
return number_of_pairs(subset_sizes)
Here I am assuming the existence of the UndirectedGraph
class, and the two functions get_disjoint_subset_sizes
and number_of_pairs
. The problem is broken down into three nice chunks that can potentially be solved independently. Each part also has a clearly defined task to perform which makes it very easy to follow how the data is transformed into the desired result.
Now that that there is a plan, I will proceed with finding a solution to the sub-problems described in the previous section:
I will first consider the function number_of_pairs
and assume I have a valid list of subset_sizes
a, b, c, d, … to pass in as an argument. The list subset_sizes
represents the number of astronauts for each nationality.
I can try to find a solution by solving simplified versions to build up an understanding, and then extend to the full solution once I see a pattern form.
Zero elements in number_of_pairs
is an invalid input so we will first consider a single element a. This means that the input consists of only a astronauts of a single nationality A. The result of the calculation should be 0 because it is not possible to make a pair of astronauts consisting of two nationalities when there only exists one nationality.
Things start to get interesting with more than one nationality. Consider two nationalities, one with a members and the second with b. The number of pairs that can be created will be: ab because each member of nationality A can form a pair with every member of B and therefore the sum of all the pairs is the product.
If we add a third nationality with c members, the total number of pairs will consist of the number of pairs for two nationalities as they will form pairs anyway, plus the pairs that can be formed by nationality C with everyone else. The total will therefore be: ab + (a + b)c
If we add a fourth, the number of pairs will be the number of pairs of three which we already know the answer to, plus the number of pairs the fourth nationality will form with everyone else: ab + (a + b)c + (a + b + c)d
There is a pattern already and looks like the number of pairs for an arbitrary number of nationalities is:
ab + (a + b)c + (a + b + c)d + (a + b + c + d)e + ...
where a, b, c, d, e, ... are the number of members of each unique nationality.
This leads to the intuition that each time the size of the problem is increased by adding a new nationality, the new members would have to form pairs with everyone else, and everyone else will form pairs with each other. Therefore the total number of pairs is the sum of the two. This leads to an inductive hypothesis where given the number of pairs for n nationalities, the number of pairs for n + 1 nationalities is determined by a simple summation.
The following function will perform this computation just like how the induction hypothesis is defined.
def number_of_pairs(subset_counts):
'''
A recursive implementation to calculate the number of pairs.
'''
if type(subset_counts) is not list:
raise TypeError("subset_counts must be a list of integers")
if len(subset_counts) <= 1:
return 0
first_element = subset_counts[0]
remaining_elements = subset_counts[1:]
return number_of_pairs(remaining_elements) + sum(remaining_elements) * first_element
This code will give the correct answer for small input sizes but it will cause the certain Hackerrank unit tests to fail due to the recursion exceeding the stack space for larger inputs.
The bottleneck for this algorithm is therefore the recursion which can be eliminated by performing the same operations using iteration.
There is also duplication of work with the summation over the remaining_elements
array which can be eliminated by strengthening the induction hypothesis with the sum already given. This will be done by building it up while the computation proceeds.
Incorporating these two observations gives the following code which pass the Hackerrank unit tests:
def number_of_pairs(subset_counts):
'''
An iterative implementation to calculate the number of pairs.
'''
if type(subset_counts) is not list:
raise TypeError("subset_counts must be a list of integers")
if len(subset_counts) <= 0:
return 0
result = 0
summation_so_far = subset_counts[0]
for count in subset_counts[1:]:
result += summation_so_far * count
summation_so_far += count
return result
An issue that needs to be taken care of is how to traverse the whole of a graph that consists of disconnected parts. Each disconnected part corresponds to a different nationality.
In the following code I’m using wishfull thinking again by assuming I have a graph that is iterable. This function iterates over all of the vertices of the graph and obtains the number of vertices in the disconnected graph that each vertex belongs to. The results are appended to a list and returned.
def disjoint_subset_sizes(graph):
'''
A function to obtain the disjoint subset sizes from a graph.
'''
result = []
for vertex in graph:
subset_counter = subset_size(graph, vertex)
if subset_counter > 0:
result.append(subset_counter)
return result
As already mentioned, a graph traversal will be necessary, which will be performed by the function subset_size
. While the traversal proceeds, the number of connected vertices will be counted. There are two main ways to perform this action. They are called Breadth First Search (BFS) and Depth First Search (DFS). Both of these algorithms allow every node within a connected graph to be visited. The difference is the order in which it is done. I choose to implement DFS because it has an intuitive recursive implementation:
def subset_size(graph, vertex, counter = 0):
'''
A recursive implementation of DFS to count the number of elements in the disjoint subset which 'vertex' belongs to.
'''
if not graph.marked[vertex]:
graph.marked[vertex] = True
counter += 1
for adjacent_vertex in graph.adj(vertex):
counter = subset_size(graph, adjacent_vertex, counter)
return counter
This algorithm keeps track of the number of vertices visited along the traversal, and avoids revisiting the same vertex by “marking” them in the marked
array with a boolean value. If a vertex is not already visited then it will obtain the list of adjacent vertices via the public method adj
, and recursively calls itself on each of those until no more unvisited vertices remain. Finally the number of visited vertices is returned.
This function will suffer the same bottleneck as number_of_pairs
due to the recursion. It is useful to have a recursive version however. It is easy to write and will serve as something to build upon and compare with when writing the slightly more complex iterative version. There a stack is used to simulate the call stack in order to eliminate the recursion:
def subset_size(graph, vertex):
'''
An iterative implementation of DFS to count the number of elements in the disjoint subset which 'vertex' belongs to.
'''
stack = [vertex]
counter = 0
if graph.marked[vertex]:
return counter
while len(stack) != 0:
current_vertex = stack.pop()
if not graph.marked[current_vertex]:
graph.marked[current_vertex] = True
counter += 1
for adjacent_vertex in graph.adj(current_vertex):
stack.append(adjacent_vertex)
return counter
All that remains is a Graph data structure to represent the input. The functions in the previous sections have all been written assuming a very simple API consisting of only two public methods, a constructor to create and initialise my graph, and the adj
method to return a list of neighbours to a given vertex. My implementation of the Graph data structure also uses the [iterator pattern]({{< ref “python-iterators.md” >}}). The implementation of the iterator object AdjacencyListGraphIterator
has been omitted.
class UndirectedGraph:
'''
A class representing an undirected graph.
Each vertex is represented by an integer from zero to the number of vertices minus one.
'''
def __init__(self, num_vertices, edges):
if num_vertices is None:
raise TypeError("Number of vertices cannot be None")
if edges is None:
raise TypeError("List of edges cannot be None")
self.num_vertices = num_vertices
self._graph = [[] for _ in range(self.num_vertices)]
self.marked = [False] * self.num_vertices
try:
for edge in edges:
self.__add_undirected_edge(edge)
except Exception as e:
print(e)
def __iter__(self):
return AdjacencyListGraphIterator(self.num_vertices)
def __add_undirected_edge(self, edge):
if edge is None:
raise TypeError("Added edge cannot be None")
v, w = edge
if v is None or w is None:
raise TypeError("Added vertex cannot be None")
if w not in self._graph[v]:
self._graph[v].append(w)
if v not in self._graph[w]:
self._graph[w].append(v)
def adj(self, vertex):
'''
Return the list of adjacent vertices to the given vertex.
'''
return self._graph[vertex]
I have tried my best to break down the problem and solve the parts in an intuitive way. This is not the only way to solve the problem or probably the most efficient, but I have aimed to make it as correct as possible. Being readable and therefore easy to reason about makes it more likely to be correct.
View source code.