When in doubt, try sorting

Posted by Tunde Michael | Mon Nov 6, 2017 06:07 PM | Viewed: 677

1.0 Searching

1.0.0 Introduction

  Consider what it takes to search for "Michael" in an array of 1billion unique strings (or texts) stored in no particular order. If we're lucky, "Michael" will be the first element of the array (at index 0) or close to the first, say at array index less than 1000. But if we're not so lucky, the term might be the last element in the array or might not even be present at all. 

  In any of these cases, there is only one way to find out if the term exists and at what position in the array. And that is to search through the entire collection and compare every element against our search term. In simple terms, that is 1billion iteration for a text that might not even be present. And if we have to do this for say a billion requests or users per day the way a company like Google does every day? Please don't bother to do the Math of 1 billion x 1 billion. But wait, what if our collection grows to 100billion records?

   /**
     *
     * @param collection
     * @param searchValue
     * @return the index of the first occurrence of the search value if found or
     * -1 if collection is empty or the search value is not present in the
     * collection.
     */
    public static int linearSearch(int[] collection, int searchValue) {
        if (collection == null || collection.length == 0) {
            return -1;
        }
        for (int i = 0; i < collection.length; i++) {
            if (collection[i] == searchValue) {
                return i;
            }
        }
        return -1;
    }


Of course what we have just described is the Linear Search Algorithm (implemented in Java above) so let's talk about how to solve this problem with a more efficient algorithm made possible only by sorting the array first. 

1.0.1 Binary Search

  Binary search is only made possible by sorting algorithms. But before we go into that, let's talk about the huge improvement gain it has over the linear algorithm described above. 

  You remember we have to perform 1billion iterations in the worst case and compare every element against our search term earlier? With binary search algorithm, we only have to perform 30 iterations to know if the text exists in our array of a billion records and return its position in the worse case and if our collection grows to 100billion records, we will only perform 37 iterations as binary search has O(logn) complexity against linear search with O(n) complexity where n is the number of elements in the collection i.e log to base 2 of 1billion is approximately 30 and log to base 2 of 100billion is 37.

  Imagine 30 against 1 billion or 37 against 100billion iterations? And if we have to handle 1billion search requests in a day, that's only 30billion iterations compared to 1billion by 1billion and 37 compared 1billion by 100billion in the former algorithm. That's a huge performance gain but it's only possible if the data is sorted. 

   /**
     *
     * @param collection
     * @param searchValue
     * @return the index of the first occurrence of the search value if found or
     * a value less than 0 if collection is empty or the search value is not
     * present in the collection.
     */
    public static int binarySearch(int[] collection, int searchValue) {
        Arrays.sort(collection); // a local copy of the array may be neccessary
        int value = Arrays.binarySearch(collection, searchValue);
        return value;
    }

2.0 Anagram Test

  Dictionary.com defines anagram as a word, phrase or sentence formed by re-arranging letters of another. Examples are (tunde, ednut), (spar, rasp), (silent, listen), (angel, glean) etc. Although anagrams are mostly used for fun brain recreational exercises, some serious applications are found in Psychology to assess the implicit memory of young people. But don't be surprised if you get asked anagram test in a technical interview (mostly as warm-up questions anyways).

   /**
     *
     * @param text1
     * @param text2
     * @return true if anagram and false otherwise. Hint: dont forget Java is
     * case sensitive.
     */
    public static boolean isAnagram(String text1, String text2) {
        if (text1 == null || text2 == null) {
            return false;
        }
        char[] array1 = text1.toCharArray();
        char[] array2 = text2.toCharArray();
        Arrays.sort(array1);
        Arrays.sort(array2);
        // after sorting, the 2 arrays must be equal
        return Arrays.equals(array1, array2);
        // I suggest you try and implement binary search without using Java inbuilt 
        // methods 
        // of Arrays.sort() and Arrays.binarySearch();
    }


I suggest you challenge yourself to write a method or function that generates and prints every possible anagram (meaningful words or not) of every word, phrase or sentence passed to it as a parameter.

3.0 Subset Check

   Set A is a subset of set B if and only if A is wholly contained in B or A is exactly equal to B (in terms the same number of elements and their values at all indices). Although this sounds Mathematical, subset problems are more common in computer science and programming than it seems. Database queries with AND operators are some form of subset searches. Imagine a query with 5 conditions that must be met, e.g

SELECT * FROM CUSTOMER c WHERE c.registration_date = ? 
AND c.age >= ? AND c.city = ? AND c.country = ?
AND c.date_of_birth BETWEEN  ? AND ?


  These 5 conditions are a subset of a larger set of perhaps 30 more attributes of the customers in our database. And for this query to return a customer, that customer's subset of the selected criteria must be a subset of our set with values for these attributes. That probably sounds confusing, let's look at a more practical example. 

  Your application is a basic payment system for a company that processes the payment in batches. It takes in an Excel file containing IDs, names, and salaries of 10,000 employees at a time, extracts the data and check the supplied ID of each employee and the uploaded amount to be paid against a database of 100,000 employees of the company. If an invalid ID is found or there is a discrepancy in the salary uploaded and the amount in the database as salary for any employee, your app should reject the file and raise fraud alarm.

  A terrible solution will be to extract the 10,000 record per batch and loop through, passing each ID and amount into a query to the database to check if the ID exists and the salary amount is equal to what is present in the database. That's a horrible number of database requests. And of course, database calls should be avoided in a loop.

 As this is purely a subset problem, sorting presents one of the best ways of solving it. A good solution is to make one request to retrieve all the records in the database sorted (100,000 is not a lot of data that can't be held in memory) and perform a binary search of the elements of the smaller array (10,000 records from the Excel file) against the bigger one. If an ID is not found or an amount does not match, then fraud is detected.

   /**
     *
     * @param big
     * @param small
     * @return true if the small array is a subset and false otherwise
     */
    public static boolean isSubset(int[] big, int[] small) {
        if (big == null || big.length == 0 || small == null || small.length == 0) {
            return false;
        }
        Arrays.sort(big);
        for (int i = 0; i < small.length; i++) {
            if (Arrays.binarySearch(big, small[i]) < 0) {
                return false;
            }
        }
        return true;
    }


If at any time, the binary search subroutine returns a negative integer, the subset rule is violated and therefore the small array is not a subset of the big one. 

4.0 Disjoint Set

  With a solution quite similar to that of the subset example examined earlier, the disjoint set problem asks somewhat of an opposite question. To check if 2 sets are disjoint is to prove that they don't have any element in common. In practical terms, if we're friends on Facebook, do we have at least one mutual friend? (Perhaps we're writing a rudimentary algorithm to predict "People you may know" based on mutual friends on Facebook).

  The only difference to ensure an optimal solution is to perform the binary search subroutine exactly as in the subset solution but this time against the smaller set instead of the larger one i.e sort the smaller set and loop through the bigger one. The reason is that we need to have a greedy mindset which is to return as soon as we find a match (disjoint prove holds only when no match is found in the 2 sets) and declare the arrays non-disjoint.

4.0 Duplicate Search

  Without incurring memory overhead of an additional data structure (like Hashtable), the most efficient algorithms sort the collection and perform a linear search on the collection checking all adjacent pairs. 

How do we check if a collection has duplicate values, the example below uses 2 pointers while looping through the collection.

   /**
     *
     * @param a
     * @return true if the collection contains duplicates and false otherwise
     */
    public static boolean hasDuplicate(int[] a) {
        Arrays.sort(a);
        int x = 0; // trailing pointer
        // note the iteration starts from 1 not 0
        for (int i = 1; i < a.length; i++) {
            if (a[x] == a[i]) {
                return true;
            }
            x++;
        }
        return false;
    }

5.0 Closest Pair

Given an array of integers (without duplicates), find the pair with the smallest difference between them.

  Closest pair problem has a lot of variations, let's settle for this as the goal is to emphasize how sorting helps us solve this problem elegantly and efficiently rather than working on the toughest of problems out there. Please note that the same principle can be applied to many closest pair problems.

  This problem becomes a lot easier to solve once the array is sorted as the closest pair of elements must lie next to each other somewhere in the array and performing an O(n) linear search on the sorted array will allow us to locate the pair in approximately O(nlogn) worse case.

   /**
     *
     * @param a Prints the closest pair to the console.
     *
     */
    public static void printClosestPair(int[] a) {
        // check for null and empty array here
        Arrays.sort(a);
        int x = a[0];
        int y = a[1];
        int d = y - x;
        int i = 1;
        for (int j = 2; j < a.length; j++) {
            if (a[j] - a[i] < d) {
                x = a[i];
                y = a[j];
                d = y - x;
            }
            i++;
        }
        System.out.println("Closest pair:  " + x + "," + y);
    }

6.0 Frequency Distribution

  Frequency distribution answers the question: which of our products is the highest selling or which of our products do we need to re-stock the most frequently? I'm sure you can come up with hundred more variations of this question.

  The goal is to solve this problem efficiently without incurring memory overhead of additional data structure. And this can be achieved by sorting the collection.

   /**
     *
     * @param a
     * @return the element with the highest frequency
     */
    public static int getMostFrequentElement(int[] a) {
        // handle null, empty & other edge cases here
        Arrays.sort(a);
        // set default values
        int mostFreq = a[0];
        int highestFreq = 1;
        int length = a.length;
        int startPointer = 0;
        int tempFreq;
        int leadPointer;
        while (startPointer < length) {
            tempFreq = 1;
            leadPointer = startPointer + 1;
            while (a[startPointer] == a[leadPointer] && leadPointer < length) {
                leadPointer++;
                tempFreq++;
            }
            if (tempFreq > highestFreq) {
                mostFreq = a[startPointer];
                highestFreq = tempFreq;
            }
            startPointer = leadPointer;
        }
        return mostFreq;
    }

7.0 nth Largest or Smallest Element of a Collection

  Finding the minimum, maximum or nth largest or smallest element of a collection is one problem made very easy by sorting. Sorting becomes more golden especially when multiple requests are going to be made for such data on this collection. 

   /**
     *
     * @param a
     * @param position
     * @return nth largest element in the collection
     *
     */
    public static int findnThLargest(int[] a, int position) {
        // handles null, empty and other special cases here
        Arrays.sort(a);
        int nthLargest;
        if (position > a.length / 2) {
            nthLargest = a[position - 1];
        } else {
            nthLargest = a[a.length - position - 1];
        }
        return nthLargest;
    }


One other way of implementing this (more concise and cleaner) is by passing in a reverse order comparator. Let's find the nth smallest element with this approach.

   /**
     *
     * @param a
     * @param n
     * @return the nth smallest element in the array
     */
    public static int findnThSmallestElement(Integer[] a, int n) {
        // handles null, empty and other special cases here
        Arrays.sort(a, Comparator.reverseOrder());
        if (n > a.length) {
            throw new IllegalArgumentException("Invalid position.");
        }
        return a[n - 1];
    }

Conclusion 

Sorting is one of the most studied concepts in computer science and deservedly so as alot of problems that would have been impossible or at best extremely difficult are made a lot easy when the data is sorted.

So when in doubt, try sorting 

Au revoir!

 

[most of the code samples were not thoroughly tested and some not at all. Please use with care.]


Share this post
facebook twitter googleplus linkedin





Comments (1)


Michael Orokola

Awesome!
Sat May 26, 2018 02:13 PM

NEWSLETTER

Never miss a thing