My Expression: Alien Dictionary

This is article is for self study

There is a new alien language that uses the English alphabet. However, the order among the letters is unknown to you.

You are given a list of strings words from the alien language's dictionary, where the strings in words are sorted lexicographically by the rules of this new language.

Return a string of the unique letters in the new alien language sorted in lexicographically increasing order by the new language's rules. If there is no solution, return "". If there are multiple solutions, return any of them.

A string s is lexicographically smaller than a string t if at the first letter where they differ, the letter in s comes before the letter in t in the alien language. If the first min(s.length, t.length) letters are the same, then s is smaller if and only if s.length < t.length.

Example 1:

Input: words = ["wrt","wrf","er","ett","rftt"]
Output: "wertf"

Example 2:

Input: words = ["z","x"]
Output: "zx"

Example 3:

Input: words = ["z","x","z"]
Output: ""
Explanation: The order is invalid, so return "".

Constraints:

1 <= words.length <= 100
1 <= words[i].length <= 100
words[i] consists of only lowercase English letters.

What is in degree and out degree of a graph?

These terms only make sense in a directed graph.

In degree is number of arrows arriving at a node. (The arrow heads point to the node)

Out degree is number of arrows leaving from a node. (The arrow heads points away from the

node).

The need for an adjacency list

An adjacency matrix is a popular and simple way to represent a graph. However, it is most suitable when your model does not frequently manipulate vertices. Adjacency matrices are good for storing dense graphs, but in most of the other cases, an adjacency list is better than an adjacency matrix.

Adjacency list

In an adjacency list, we use an array of linked lists to represent a graph instead of a matrix.

For example, consider the graph below.

Refer: https://leetcode.com/problems/alien-dictionary/solution/

There are three parts to this problem.

Getting as much information about the alphabet order as we can out of the input word list.
Representing that information in a meaningful way.
Assembling a valid alphabet ordering.

Part 1: Extracting Information

Let's start with a large example of a dictionary in an "alien language", and see how much we can conclude with some simple reasoning. This is likely to be your first step in tackling this question in a programming interview.

List of alien words: wxqkj, whqg, cckgh, cdxg, cdxdt, cdht, ktgxt, ktgch, ktdw, ktdc, jqw, jmc, jmg

Remember that in an ordinary English dictionary, all the words starting with a are at the start, followed by all the ones starting with b, then c, d, e, and at the very end, z. In the "alien dictionary", we also expect the first letters of each word to be in alphabetical order. So, let's look at them.

First letters of each word: w, w, c, c, c, c, k, k, k, k, j, and j

Removing the duplicates, we get:

First letters of each word without duplicates: w, c, k, and j

Let's take a closer look at the first two words of the "alien dictionary"; wxqkj and whgg. Seeing as the first letters are the same, w, we look at the second letters. The first word has x, and the second word has h. Therefore, we know that x is before h in the alien dictionary. We know have two fragments of the letter-order.

First two sequences, w, c, k, j and x, h

Let's have a look at all the relations we can find by comparing adjacent words.

Relations between words

You might notice that we haven't included some rules, such as w → j. This is fine though, as we can still derive it from the rules we have: w → c, c → k, k → j.

This completes the first part. There is no further information we can extract from the input. Therefore, our task is now to put together what we know.

Part 2: Representing the Relations

We now have a set of relations stating how pairs of letters are ordered relative to each other.

Relations: x → h, w → c, c → d, g → d, c → k, x → c, k → j, q → m, and c → g

How could we put these together? You may be tempted to start trying to build "chains" out of them. Here are a few possible chains.

Some combined chains: w→c→k→j, w→c→d, x→c→k→j, and q→m

The problem here though is that some letters are in more than one chain. Simply putting the chains into the output list one-after-the-other won't work. Some letters will be duplicated, which invalidates the ordering. Simply deleting the duplicates will not work either.

When we have a set of relations, often drawing a graph is the best way to visualize them. The nodes are the letters, and an edge between two letters, A and B represents that A is before B in the "alien alphabet".

Graph with sources highlighted

Part 3: Assembling a Valid Ordering

As we can see from the graph, four of the letters have no arrows going into them. What this means is that there are no letters that have to be before any of these four (remember that the question states there could be multiple valid orderings, and if there are, then it's fine for us to return any of them).

Therefore, a valid start to the ordering we return would be:

First group ordering: q w t x

We can now remove these letters and edges from the graph, because any other letters that required them first will now have this requirement satisfied.

Graph after first step with new sources highlighted

On this new graph, there are now three new letters that have no in-arrows. We can add these to our output list.

Two groups ordering: q w t x, m h c

Again, we can remove these from the graph.

Graph after second step with new sources highlighted

Then add the two new letters with no in-arrows.

Three groups ordering: q w t x, m h c, g k

Which leaves the following graph.

Graph after third step with new sources highlighted

With the final two letters.

All groups ordering: q w t x, m h c, g k, j d

Which is a valid ordering that we can return.

As a side note, each of the four groups of letters we picked off could have been in any order within themselves (as another side note, it's not too difficult to calculate how many valid orderings there are. Have a think about this if you want, determining how many valid alphabet orderings there are is an interesting follow-up question!)

Algorithm

Now that we have come up with an approach, we need to figure out how to implement it efficiently.

The first and second parts are straightforward; we'll leave you to look at the code for these. It should extract the order relations and then insert them into an adjacency list. The only thing we need to be careful of is ensuring that we've handled the "prefix" edge case correctly.

Adjacency list

The third part is more complicated. We need to somehow identify which letters have no incoming links left. With the adjacency list format above, this is a bit annoying to do, because determining whether or not a particular letter has any incoming links requires repeatedly checking over the adjacency lists of all the other letters to see whether or not they feature that letter.

A naïve solution would be to do exactly this. While this would be efficient enough with at most 26 letters, it may result in your interviewer quickly telling you that we might want to use the same algorithm on an "alien language" with millions of unique letters.

An alternative is to keep two adjacency lists; one the same as above, and another that is the reverse, showing the incoming links. Then, each time we traverse an edge, we could remove the corresponding edge in the reverse adjacency list. Seeing when a letter has no more incoming links would now be straightforward.

Reverse adjacency list

However, we can do even better than that. Instead of keeping track of all the other letters that must be before a particular letter, we only need to keep track of how many of them there are! While building the forward adjacency list, we can also count up how many incoming edges each letter has. We call the number of incoming edges the indegree of a node.

Count list

Then, instead of removing an edge from a reverse adjacency list, we can simply decrement the count by 1. Once the count reaches 0, this is equivalent to there being no incoming edges left in the reverse adjacency list.

We'll do a BFS for all letters that are reachable, adding each letter to the output as soon as it's reachable. A letter is reachable once all of the letters that need to be before it have been added to the output. To do a BFS, recall that we use a queue. We should initially put all letters with an in-degree of 0 onto that queue. Each time a letter gets down to an in-degree of 0, it is added to the queue.

We continue this until the queue is empty. After that, we check whether or not all letters were put in the output list. If some are missing, this is because we got to a point where all remaining letters had at least one edge going in; this means there must be a cycle! In that case, we should return "" as per the problem description. Otherwise, we should return the complete ordering we found.

One edge case we need to be careful of is where a word is followed by its own prefix. In these cases, it is impossible to come up with a valid ordering and so we should return "". The best place to detect it is in the loop that compares each adjacent pair of words.

Also, remember that not all letters will necessarily have an edge going into or out of them. These letters can go anywhere in the output. But we need to be careful to not forget about them in our implementation.

Here is an animation showing the entire algorithm on our example from earlier.

public String alienOrder(String[] words) {

    // 1. For unique letters, creating data structure adjacency list map and in-degree map with default values
    Map<Character, List<Character>> adjacencyList = new HashMap<>();
    Map<Character, Integer> inDegree = new HashMap<>();
    for (String word : words) {
        for (Character c : word.toCharArray()) {
            inDegree.put(c, 0);
            adjacencyList.put(c, new LinkedList<>());
        }
    }

    // 2. Populating the data structure
    for (int i = 0; i < words.length - 1; i++) {
        String word1 = words[i];
        String word2 = words[i + 1];

        // if prefix of a word comes after a given word then its not in lexicographical order and just return empty
        if (word1.startsWith(word2) && word1.length() > word2.length()) return "";

        //Populating adjacency list map and in-degree map with correct values by comparing adjacent words.
        // Find the first non match and insert the corresponding relation.
        for (int j = 0; j < Math.min(word1.length(), word2.length()); j++) {
            if (word1.charAt(j) != word2.charAt(j)) {
                inDegree.put(word2.charAt(j), inDegree.get(word2.charAt(j)) + 1);
                adjacencyList.get(word1.charAt(j)).add(word2.charAt(j));
                break;

            }
        }

    }

    //3. BFS - a) adding character with 0 in-degree alien order. b) adding character with 0 in-degree to queue to consider its relative smaller character
    StringBuilder sb = new StringBuilder();
    Queue<Character> queue = new LinkedList<>();
    for (Character c : inDegree.keySet()) {
        if (inDegree.get(c) == 0) {

            queue.offer(c);
        }
    }

    while (!queue.isEmpty()) {
        Character c = queue.poll();
        sb.append(c);
        for (Character next : adjacencyList.get(c)) {
            inDegree.put(next, inDegree.get(next) - 1);
            if (inDegree.get(next) == 0) {
                queue.offer(next);


            }
        }
    }

    if (sb.length() < inDegree.size()) {
        return "";
    }
    return sb.toString();
}

My Expression

Wednesday, September 21, 2022

Alien Dictionary

The need for an adjacency list

Adjacency list

No comments:

Post a Comment