Sunday, September 9, 2012

Generating all permutations, combinations, and power set of a string (or set of numbers)

Combinatorics is a branch of mathematics that deal with counting of discrete structures. Two concepts that often come up in the study of combinatorics are permutaions and combinations of a set of discrete elements.

The number of possible permutations (and combinations) that can be generated from a set of discrete elements can be huge. We will leave the mathematical analysis here and instead focus on how to generate all permutations and combinations of a set of numbers/characters by using a computer program. We will be using the Java programming language to write the code for this.

A related concept is the concept of Power Set. A Power Set of a set of discrete elements is the set of all subsets of elements from the original elements. Essentially this is a set of all combinations of all lengths (0 to size of the set) of the set.

Generating all permutations of a string (of characters):

The key to understanding how we can generate all permutations of a given string is to imagine the string (which is essentially a set of characters) as a complete graph where the nodes are the characters of the string. This basically reduces the permutations generating problem into a graph traversal problem: given a complete graph, visit all nodes of the graph without visiting any node twice. How many different ways are there to traverse such a graph?

It turns out, each different way of traversing this graph is one permutation of the characters in the given string!

We can use Depth First Search (DFS) traversal technique to traverse this graph of characters. The important thing to keep in mind is that we must not visit a node twice in any "branch" of the depth-first tree that runs down from a node at the top of the tree to the leaf which denotes the last node in the current "branch".

              START
           /       |         \
         A       B        C
       /   \      /  \      /   \
     B    C   A  C  A   B
      |      |    |     |   |      |
     C    B  C   A  B    A

In the above figure, a "branch" is the vertical line that connects all 3 characters. A "branch" is one permutation of the given string. In a recursive (DFS-based) solution the trick is to maintain an array that holds one such "branch" at any given time.

Here is the Java code:

void generatePermutations(char[] arr, int size, char[] branch, int level, boolean[] visited)
{
    if (level >= size-1)
    {
        System.out.println(branch);
        return;
    }
   
    for (int i = 0; i < size; i++)
    {
        if (!visited[i])
        {
            branch[++level] = arr[i];
            visited[i] = true;
            generatePermutations(arr, size, branch, level, visited);
            visited[i] = false;
            level--;
        }
    }
}

The above method can be called like this:

String str = "ABCD";
int n = str.length();
char[] arr = str.toCharArray();
boolean[] visited = new boolean[n];
for (int i = 0; i < n; i++)
    visited[i] = false;
char[] branch = new char[n];
generatePermutations(arr, n, branch, -1, visited);

The visited array keeps track of which nodes have been visited already.

Generating combinations of k elements:
Generating combinations of k elements from the given set follows similar algorithm used to generate all permutations, but since we don't want to repeat an a character even in a different order we have to force the recursive calls to not to follow the branches that repeat a set of characters.

If the given string is "ABC" and k = 2, our recursive tree will look like this:

           START
           /        | 
         A        B
       /     \      |
     B      C   C

Here we will have to make sure, once we start a "branch" from a node (character), we must not come back to that node (character) again to start another "branch". So, starting off a new recursive call (to traverse a new "branch") must start from the following node (character)!

Here is the Java code for generating k combinations:

void combine(char[] arr, int k, int startId, char[] branch, int numElem)
{
    if (numElem == k)
    {
        System.out.println(Arrays.toString(branch));
        return;
    }
   
    for (int i = startId; i < arr.length; ++i)
    {
        branch[numElem++] = arr[i];
        combine(arr, k, ++startId, branch, numElem);
        --numElem;
    }
}

In the above code, that variable startId makes sure we are never starting a new recursive call for a new "branch". It gets incremented for a new traversal.

To call the combine method above, do this:

int k = 2;
char[] input = "ABCD".toCharArray();
char[] branch = new char[k];
combine(input, k, 0, branch, 0);

Generating the power set:

Generating all subsets (the power set) of a given set of characters (or numbers) is very similar to generating combinations. While generating k-element combinations our goal was to print the current "branch" only when it holds all k characters from the given string. Since a power set contains combination of all lengths, we will simply call combine to generate k combinations for all k where 0 <= k < SIZE(string).

void powerSet(char[] arr)
{
    for (int i = 0; i < arr.length; ++i)
    {
        char[] branch = new char[i];
        combine(arr, i, 0, branch, 0);
    }   
}

To call the powerSet method, simply pass in the character array we want to construct the power set of.

I hope you now have a good idea of how to generate all permutations, k-element combinations, and the power set of a given set of elements. I used to find these really hard in the beginning. But once I have started thinking these in terms of graph traversal problems, things became much easier!

Graph algorithms rock :)

Monday, February 20, 2012

Life on a Tree - Creating, Copying, and Pruning Tree Data Structures

The tree is one of the most important data structures in computer programming.



A very common of example of the tree data structure being used in everyday life is a directory structure widely used both in Windows and Unix systems.

Today I will write a brief tutorial on trees, and show a number of common operations done on the structure of a tree. The code examples will be in Java.

First, let's create a tree. A tree is essentially a set of node objects that hold some data and each node has zero or more "child" nodes. The nodes without any child nodes are called leaves and the node that itself is not a child of any other node is called the root. You can read up on trees on the Web, we will focus on the implementation of trees here.

We can define a node in the tree as follows:

class Node
{
    private int data; /// this can be more complex objects
  
    private Node left;
    private Node right;
    private Node parent;
  
    public Node(int value)
    {
        data = value; // we could also use setter/getter for this value
        parent = null; // this is optional, but having this is often useful
        left = null;
        right = null;
    }
   
    public Node(Node node) // copy constructor
    {
        data = node.getData();
        left = null;
        right = null;
        parent = null;
    }

    public void setData(int data)
    {
        this.data = data;
    }
   
    public int getData()
    {
        return data;
    }
   
    public void setLeftNode(Node node)
    {
       left = node;
    }

    public Node getLeftNode()
    {
        return left;
    }

    public void setRightNode(Node node)
    {
        right = node;
    }

    public Node getRightNode()
    {
       return right;
    }

    public void setParentNode(Node node)
    {
         parent = node;
    }

    public Node getParentNode()
    {
        return parent;
    }
}

The class above is a simple example of a node in a tree. It contains a data variable, a reference to left and right sub-trees each, and a reference to the parent node as well. There are also a set of setters and getters to manipulate the data members. Please note that - in practice you might need much more complex nodes loaded with big data objects and more children than just a left and right nodes.

To create a tree using the class above you could do the following:

Node root = new Node(100); // creating a root node with value 100

root.setLeftNode(new Node(200));
root.setRightNode(new Node(50));

root.getLeftNode().setLeftNode(new Node(10));
root.getLeftNode().setRightNode(new Node(1000));

A tree is represented by the root node of the tree. So, the "root" object above represent the entire tree.

Copying a tree:

Say you have a tree that you want to copy to another tree. Now, copying can be done in two ways - i) shallow copy, and ii) deep copy.

We will show deep copy here. Deep copying means the new tree is an entirely new copy of the old tree - both in structure and in data. Everything is allocated again in memory.

The following function does a deep copy of a tree to another tree:

public void deepCopy(Node root, Node copy)
{
        Node left = root.getLeftNode();
        Node right = root.getRightNode();
       
        if (left != null)
        {
            copy.setLeftNode(new Node(left));
            deepCopy(left, copy.getLeftNode());
        }
       
        if (right != null)
        {
            copy.setRightNode(new Node(right));
            deepCopy(right, copy.getRightNode());
        }
}

This function can be called like this:

Node copy = new Node(root); // copy the root
deepCopy(root, copy); // copy the rest of the tree

Now, the tree referenced by "copy" holds an entire deep copy of the tree "root".

Pruning a tree:

Pruning means deleting one or more subtrees of a tree. We will implement a filter-based pruning here. That is, whenever a node will match some criteria described in a filter we will delete that node along with all its children from its parent.

First, we will need a way to represent a filter. We will do that by way of an interface that all filter classes will have to implement.

interface Filter
{
    public boolean check(Node node);
}

Now, we can define a Filter class as follows:

class MyFilter implements Filter
{
    public boolean check(Node node)
    {
        if (node.getData() == 200)
            return false;
        return true;
    }
}

This class indicates that we would like to delete all sub-trees rooted at a node containing the data value 200.

The pruning class will be a lot like the deep copying class. This is because the pruned tree is actually a deep copy of the original tree minus the pruned nodes!

Here is what the pruning method looks like:

public void pruneTree(Node root, Node copy, Filter filter)
{
        Node left = root.getLeftNode();
        if (left != null && filter.check(left))
        {
            copy.setLeftNode(new Node(left));
            pruneTree(left, copy.getLeftNode(), filter);
        }
       
        Node right = root.getRightNode();
        if (right != null && filter.check(right))
        {
            copy.setRightNode(new Node(right));
            pruneTree(right, copy.getRightNode(), filter);
        }
}

This method can be called as follows:

Node copy = new Node(root);
Filter filter = new MyFilter();
pruneTree(root, copy, filter);

At this point, the node object "copy" will contain a pruned version of the original tree rooted at "root".

Finally, here is a small class that can print a tree in pre-order traversal. I used this class in testing out the code in this post:

public void printTree(Node root)
{
        if (root != null)
        {
            System.out.println(root.getData()); 
            printTree(root.getLeftNode());
            printTree(root.getRightNode());
        }
}

I hope you will now be able to create simple trees quickly. It won't be hard to modify these classes and methods to build more complex tree classes to handle more than two children, copying a subtree instead of the whole tree, prune more selectively, etc.

Happy tree coding:)

Friday, February 10, 2012

Solving the Boggle Game - Recursion, Prefix Tree, and Dynamic Programming

I spent this past weekend designing the game of Boggle. Although it looks like a simple game at a high level, implementing it in a programming language was a great experience. I had to use recursion, sorting, searching, prefix trees (also knows as trie's), and dynamic programming as I was improving the run time of the program. So here is a summary of the work I did in trying to write a very optimal solution to the Boggle game:



*Image courtesy: http://stackoverflow.com/questions/746082/how-to-find-list-of-possible-words-from-a-letter-matrix-boggle-solver

The game of Boggle is played on a N x N board (usually made of cubes that has letters engraved on it). Given a dictionary, you will have to construct words on the board following these rules:

i) The letters in the word must be "adjacent" to each other
ii) Two letters on the board are "adjacent" if they are located on the board next to each other horizontally, vertically, or diagonally.
iii) A word must be 3 or more letters long to be valid

The more words you can construct, the more points you get. Longer words get more points.

In computerse: given an N x N board, and a dictionary containing hundreds of thousands of words, construct those words from the board that are found in the dictionary.

There are a number of approaches that one can take to solve this problem. I designed three different algorithms to solve it. Here are they:

First solution - Recursion + Binary Search:

In this approach, we recurse (using backtracking) through the board and generate all possible words. For each word that are three or more letters long we check to see if it's in the dictionary. If it is, we have a match!

Here is the algorithm steps:

1. Read in the dictionary in to a container in memory.
2. Sort the dictionary.
3. Using backtracking, search the board for words.
4. If a word is found and it contains 3 or more letters, do a binary search on the dictionary to see if the word is there.
5. If searching was successful in the previous step, print the letter out.
6. Continue step 3-5 as long as there are more words to construct on the board.

Complexity of this approach:

In this solution, we do a good job on the dictionary search. Using binary search, we are quickly finding out whether a word is in dictionary or not. But the real bottleneck is in searching the board for words. For an N x N board the search space is O((N*N)!). I am not exactly sure about this number, but you can find some discussions of it here: http://www.danvk.org/wp/2007-08-02/how-many-boggle-boards-are-there/.
(N*N)! is a HUGE number even for N = 5. So this approach is is impractical and out of question for any useful implementation.


Second Solution - Pruned Recursion + Prefix Tree (also known as a Trie):

From the previous approach, our major concern was the enormous search space on the board. Fortunately, using a a prefix tree or trie data structure we could significantly cut down on this search space. The reasoning behind this improvement is that, if a word "abc" does not occur as a prefix to any word in the dictionary there is no reason to keep searching for words after we encounter "abc" on the board. This actually cut down the run time a lot.

Here is the algorithm steps:

1. Read a word from the dictionary file.
2. Insert it into a prefix tree data structure in memory.
3. Repeat steps 1-2 until all words in the dictionary have been inserted into the prefix tree.
4. Using backtracking, search the board for words.
5. If a word is found and it contains 3 or more letters, search the prefix tree for the word.
6. If searching was *not* successful in the previous step, return from this branch of the backtracking stage. (There is no point to continue searching in this branch, nothing in the dictionary as the prefix tree says).
7. If searching was successful in step 5, continue searching by constructing more words along this branch of backtracking and stop when the leaf node has been reached in the prefix tree. (at that point there is nothing more to search).
8. Repeat steps 4-7 as long as there are more words to search in the backtracking.

Complexity of this approach:

This approach significantly improves on the first one. Building a prefix tree our of the dictionary words is O(W * L), where W is the number of words in the dictionary and L is the maximum length of a word in the dictionary.
Searching the board will be of the same order as the dictionary since we are not really searching words that are not in the dictionary. But in reality it will be more work than that as we still need to backtrack along the board to construct new words until we can consult the dictionary prefix tree to know whether it exists or not.

Third and Final Solution - No search space + Dynamic Programming:

The 2nd approach mentioned above was good enough until the board size was 5. Unfortunately with a board size of 6, that too was taking forever to complete!



It got me into thinking - "Dang, this search space is still too big to search! Can I just get rid of it entirely?" And then this idea popped into my mind: instead of random constructing word after word in this infinite ocean of words why don't I take a word from the dictionary and somehow magically check whether that's available on the board or not?

It turns out, we can use a nifty dynamic programming technique to quickly check whether a word (from the dictionary in this case) can be constructed from the board or not!

Here is core point of the dynamic programming idea:


For a word of length k to be found (end location) at the [i, j]-th location of the board, the k-1'th letter of that word must be located in one of the adjacent cells of [i, j].



The base case is k = 1.

A letter of length 1 will be found (end location) in the [i, j]-th cell of the board of the only letter in the word matches the letter in the [i, j]-th location of the board.

Once our dynamic programming table is populated with the base case, we can build on top of that for any word of length k, k > 1.

Here is a sample code for this:

for (k = 2; k < MAX_WORD_LENGTH; ++k)
    for (i = 0; i < N; ++i)
        for (j = 0; j < N; ++j)
            if (board[i][j] == word[k])
            {
                 for all the "adjacent" cells of board[i, j]
                     if we table[k-1][adjacent_i][adjacent_j] is true
                         then table[k][i][j] = true;
             }


*I have the C++ code, if you are interested to take a look leave a comment with your email address.

Run Time Complexity:

The run time complexity of this approach is obvious is pretty obvious. It's O (W * N * N * MAX_WORD_LENGTH). Here N is dimension of the board which is usually between 4 to 6. So essentially the algorithm runs in the order of the size of the dictionary!

I solved a 6 X 6 boggle game with a dictionary with 600,000 words in 0.002 seconds! With I/O it as about 0.57 seconds. But with the trie and binary search approaches, it took much longer!

So, here is the summary of my a weekend's worth of adventure in to the world of algorithms and data structures. Feel free to ask me any questions, or any bug in these algorithms:)

Credit where it is due: My friend Satej who is a ninja problem solver and a PhD student at the UCF explained the dynamic programming solution to me first before I refined and finalized it. Thanks Satej!


Update (5/10/2012):

So many people have asked for the source code that I feel I better add the C++ code here:) The final, polished code for this is not in my home computer here, so if there is a bug please let me know!


#include <cstdio>
#include <iostream>

using namespace std;

const int N = 6; // max length of a word in the board

char in[N * N + 1]; // max length of a word
char board[N+1][N+2]; // keep room for a newline and null char at the end
char prev[N * N + 1];
bool dp[N * N + 1][N][N];

// direction X-Y delta pairs for adjacent cells
int dx[] = {0, 1, 1, 1, 0, -1, -1, -1};
int dy[] = {1, 1, 0, -1, -1, -1, 0, 1};
bool visited[N][N];

bool checkBoard(char* word, int curIndex, int r, int c, int wordLen)
{
    if (curIndex == wordLen - 1)
    {
        //cout << "Returned TRUE!!" << endl;
        return true;
    }
   
    int ret = false;
       
    for (int i = 0; i < 8; ++i)
    {
        int newR = r + dx[i];
        int newC = c + dy[i];
       
        if (newR >= 0 && newR < N && newC >= 0 && newC < N && !visited[newR][newC] && word[curIndex+1] == board[newR][newC])
        {
            ++curIndex;
            visited[newR][newC] = true;
           
            ret = checkBoard(word, curIndex, newR, newC, wordLen);
            if (ret)
                break;
               
            --curIndex;
            visited[newR][newC] = false;
        }
    }
   
    return ret;           
}

int main(int argc, char* argv[])
{
   
    int i, j, k, l;
   
    FILE* fDict = fopen("dict.txt","r");
    FILE* fBoard = fopen("tmp2.txt","r");
   
    for(i = 0; i < N; ++i)
        fgets(board[i], N+2, fBoard);
   
    strcpy(prev,"");
    int pLen = 0;
   
    while(fgets(in, N*N + 1, fDict))
    {
        int len = strlen(in);
        if (in[len-1] == '\n')
        {
            in[len-1] = '\0'; // remove the trailing newline
            --len;
        }
       
        if (len < 3)
            continue; // we only want words longer than 3 or more letter
           
        for(i = 0; i < len && i < pLen; ++i)
        {
            if(prev[i] != in[i])
                break;
        }
       
        int firstMismatch = i; // little optimization: building on previous word (will benefit if the word list is sorted)
       
        if(firstMismatch==0)
        {
            for(i = 0; i < 6; ++i)
            {
                for(j = 0; j < 6; ++j)
                {
                    if(board[i][j] == in[0])
                        dp[0][i][j] = true;
                    else
                        dp[0][i][j] = false;
                }
            }
            firstMismatch = 1;
        }
       
        for(k = firstMismatch; k < len; ++k)
        {
            for(i=0;i<6;++i)
            {
                for(j=0;j<6;++j)
                {   
                    dp[k][i][j] = false;
                           
                    if(board[i][j] != in[k])
                        continue;
                       
                    for(l= 0; l < 8 && !dp[k][i][j]; ++l)
                    {
                        int ti = i + dx[l];
                        int tj = j + dy[l];
                       
                        if(ti < 0 || ti >= 6 || tj < 0 || tj >= 6)
                            continue;
                       
                        if (dp[k-1][ti][tj])
                            dp[k][i][j] = true;
                    }
                }
            }  
        }
       
        // check if the word is tagged as found in the dp table
        bool flag = false;
        for(i = 0; i < 6 && !flag; ++i)
        {
            for(j = 0; j < 6 && !flag; ++j)
            {
                if(dp[len-1][i][j])
                    flag =true;
            }
        }
       
        // dp table says its there, but make sure its in the board and it does not repeat a location in the board
        if(flag)
        {
            //cout << "Checking word: " << in << endl;
            bool verified = false;
           
            for (i = 0; i < N && !verified; ++i)
            {
                for (j = 0; j < N && !verified; ++j)
                {
                    if (in[0] != board[i][j])
                        continue;
                       
                    memset(visited, false, sizeof(visited));
                    visited[i][j] = true;
                   
                    if (checkBoard(in, 0, i, j, len))
                    {
                        cout << in << endl;
                        break;
                    }
                }
            }
        }
       
        strcpy(prev,in);
        pLen=len;
           
    }
   
    return 0;
}


Enjoy!

Update 3/15/2015: I have added a Java version of the boggle game solution on my Github page here: https://github.com/bilash/boggle-solver

The Java version builds a Trie for the dictionary and then uses the dynamic programming approach mentioned above.

Friday, January 20, 2012

The Power of Perl Regular Expressions

I am not a big scripting language person. C++ and Java (and occasional C#) have been my stronghold so far. But recently I have been strolling around in the Perl world at work. And to my amazement I found Perl so rich a language that I fell in love with it.

Here is an example of the power and succinctness of Perl. This one involves using Perl's powerful regular expressions and file/directory handling features.

For a project I was doing I needed to modify contents of a large number of files in a directory. The actual modification needed was really simple. For all occurrences of a term, say XYZ, replace it with .XYZ (insert a dot before the term). The term XYZ could or could not be preceded by a "-" (minus sign). Other than that the term is guaranteed to be separated by white space before and after it. Also, the term can occur at the beginning or at the end of a line.

Now, I was new to Perl, and regular expressions for that matter. I struggled with the problem for one and half day. There were always one or two cases that I was not covering, or cases I was not covering that I was not supposed to. Finally I gave up and posted this as a question on Stackoverflow:

Overlapping text substitution with Perl regular expression.

Within 10 minutes I had a bunch of answers from some really Perl expert guys. I am so thankful to those people who answered it and saved me from more days of struggling with the problem!

Here is what I ended with for the whole problem:


$^I = ".copy";
my @ARGV = <*.txt>;

while (<>)
{
    s/((?:^|\s)-?)(XYZ)(?=\s|$)/$1.$2/g;
    print;
}


That's it! How cool is that?

If it was in C++ or Java, I would have to deal with opening file streams, buffer streams in Java, open a file one at a time, read a line one by one, store it on a list or something in memory, and then write the whole thing back to file system, so on and so forth. And I would easily end up writing close to a hundred lines of code for that!

But in Perl, it's just those few lines above. The diamond operator (<>) nicely takes care of the directory and file browsing part so succinctly. And the one line regular expression finds all references of the pattern in all the files (denoted by the wildcard *.txt) and replaces the matching patterns with a dot inserted in the beginning.

I admit, Perl's regex and file/directory handling packages are working in the background, but still the expressive power of these tools are so elegant!

I hope to keep exploring Perl more in the coming days. Specially the powerful regular expressions!