How to efficiently find the n’th biggest element in a collection

Looking at std::nth_element the other day, I noticed that it’s complexity is O(n) for a collection of size n, and wondered how that was achieved. A basic implementation of an algorithm to find the ith-biggest element might start by sorting the collection and index to the ith afterwards – complexity O(n*log(n)):

int ith_element_by_sorting( std::vector<int> input, int i )
{
    std::sort( std::begin(input), std::end(input) );
    return input[i];
}

It’s a small step to realise then that you don’t need to sort all n elements of the collection – only the first i elements, so complexity O(n*log(i)):

int ith_element_by_partial_sort( std::vector<int> input, int i )
{
    std::partial_sort( std::begin( input ), std::begin( input ) + i + 1, std::end( input ) );
    return input[i];
}

But the real trick is that you don’t need to do any sorting at all. That’s the approach taken by quickselect, which is the selection sibling to quicksort, and achieves O(n) complexity on average:

int partition( std::vector<int>& values, int left, int right, int pivot )
{
     auto pivotValue = values[ pivot ];

     std::swap( values[pivot], values[right] ); // Move pivot to end
     auto store_pos = left;

     for ( int j = left; j < right; ++j )
     {
         if ( values[j] < pivotValue )
         {
             std::swap( values[ store_pos ], values[j] );
             ++store_pos;
         }
     }

     std::swap( values[right], values[ store_pos ] );  // Move pivot to its final place

     return store_pos;
}

int quickselect( std::vector<int> values, int left, int right, int i )
{
    // Unordered - no need to sort values at all, 
    // instead we recursively partition only the subset of values
    // containing the i'th element, until we have either
    // a) trivial subset of 1
    // b) pivot is moved to exactly the location we wanted 
    while( 1 )
    {
         if ( left == right )
         {
             return values[left];
         }

         // Pick a pivot from middle of values.
         // Better options are a random pivot or median of 3
         auto pivot = (left + right)/2; 

         // Move anything smaller than values[pivot] to the left of pivot,
         // and return updated position of pivot
         pivot = partition( values, left, right, pivot );

         if ( pivot == i )
         {
             return values[i];
         }
         else if ( i < pivot )
         {
             right = pivot - 1;
         }
         else
         {
             left = pivot + 1;
         }
    }
}

int ith_element_by_quickselect( const std::vector<int>& input, int i )
{
    return quickselect( input, 0, input.size()-1, i );
}

int ith_element( const std::vector<int>& input, int i )
{
    if ( i < 0 || i >= input.size() )
    {
        std::ostringstream ss;
        ss << "Input '" << i << "' outside range [0," << input.size() << ")";
        throw std::out_of_range( ss.str() );
    }

    return ith_element_by_quickselect( input, i );
}

Here’s some test code to check that the implementation works:

template <typename F, typename T>
void should_be( T t, F f, const std::string& message )
{
    try
    {
        std::ostringstream ss;

        auto got = f();
        if ( got != t )
        {
            ss << message << " got " << got << ", expected " << t;
            throw std::runtime_error( ss.str() );
        }
        else
        {
            ss << "OK: " << message << ": got " << got;
            std::cout << ss.str() << "\n";
        }
    }
    catch( const std::exception& ex )
    {
        // Report error if either f() threw or we found unexpected value
        std::cout << "ERROR: " << ex.what() << "\n";
    }
}

template <typename F>
void should_throw( F f, const std::string& message )
{
    try
    {
        f();
    }
    catch( const std::exception& ex )
    {
        std::cout << "OK: " << message << ": threw \"" << ex.what() << "\"\n";
        return;
    }

    std::cout << "ERROR: " << message << " should have thrown\n";
}

#define SHOULD_BE( t, expr ) should_be( t, [](){ return expr; }, #expr )
#define SHOULD_THROW( expr ) should_throw( [](){ expr; }, #expr )

void testIthElement()
{
    SHOULD_THROW( ith_element( {}, 0 ) );
    SHOULD_THROW( ith_element( {1,2}, -1 ) );
    SHOULD_THROW( ith_element( {1,2,3}, 3 ) );

    SHOULD_BE( 1, ith_element( {1}, 0 ) );
    SHOULD_BE( 0, ith_element( {0,1,2,3,4,5,6,7,8}, 0 ) );
    SHOULD_BE( 2, ith_element( {0,1,2,3,4,5,6,7,8}, 2 ) );
    SHOULD_BE( 6, ith_element( {5,4,7,6,1,2,0,8,3}, 6 ) );
    SHOULD_BE( 8, ith_element( {5,4,7,6,1,2,0,8,3}, 8 ) );
    SHOULD_BE( 5, ith_element( {5,5,5,5,5,5}, 1 ) );
}

Here’s the output:
QuickSelect

In fact, reviewing old posts on this blog, I found this link that dates back to 2013, when std::nth_element was only required by the standard to have worst-case O(N^2) complexity, with O(N) on average – precisely what you’d get from quickselect. Now though, thanks to introselect, O(N) is possible.

Leave a comment

Filed under C++, C++ Code, Programming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s