Up until recently I’ve been very good about using std::string for strings in all of my projects.  In Kuiper 2 I decided that I wanted to try something different.  String interning is a method of allocating strings such that there is only one instance of a particular string.  For example:

// If all strings were interned then writing:
const char *a = "Hello, world!";
const char *b = "Hello, world!";

// Then this would be true:
if( a == b )
    std::cout << "The strings are the same!" << std::endl;

The code segment above is meant as an illustration and won’t necessarily work as expected. It is possible to have the compiler combine all copies of identical string literals via “string pooling” using the /GF option in MSVC(I think GCC pools strings by default.) With pooling, the code above will work as expected but it only takes care of string literals. What about dynamically allocated strings?

String Pool

To handle dynamic strings we will need to create our own string pool that will keep a unique copy of every string. Let’s keep the interface very simple:

// A wrapper around a const char *.  This helps distinguish interned strings
// from any other kind of string.
class InternedString
{
    const char *string_;

public:
    InternedString() : string_( nullptr ) { }
    explicit InternedString( const char *string ) : string_( string ) { }

    const char *Data() const { return string_; }
    operator const char *() const { return string_; }

    bool operator ==( const InternedString &rhs ) const { return string_ == rhs.string_; }
    bool operator !=( const InternedString &rhs ) const { return string_ != rhs.string_; }
};

// The string pool that will hold instances of interned strings.
class Strings
{
    class Impl;
    Impl *impl_;

public:
    Strings();
    ~Strings();

    InternedString Internalize( const char *str );
};

Using this abstraction we will be able to write things like this:

Strings strings;
InternedString a = strings.Internalize( "Hello, world!" );
InternedString b = strings.Internalize( "Hello, world!" );

// And this would be true.
if( a == b )
    std::cout << "The strings are the same!" << std::endl;

This is all neat and nice, but why would we want to do this?

Why?

Depending on your use-case there may be some benefits for interning strings. Here are some reasons why I decided to use interning in Kuiper 2:

  • Only one copy of each unique string – Kuiper 2 uses a data driven approach for all of the in game assets. All scenes, textures, shaders, sound effects, pieces of text and everything else has an internal name that can be referenced by the engine. Interning the names means only one copy of each name is around.
  • Fast string comparisons – This is related to the previous point. Since there is only one copy of each name, finding named objects and comparing names can be done very quickly! It’s as fast as comparing pointers and games are supposed to be fast.
  • Warm and fuzzy feelings – It feels great knowing that all your strings are in one place.

Now for the dirty bit!

Implementation

Essentially the string pool is nothing more than a set of strings where equality is determined by string equality. Internalizing a string involves a check to see if a copy of it is already in the set. If it’s in there we can just return a pointer to that string. If the string is not found then we have to make a copy, insert it into a set and return a pointer to the new copy. So let’s just write exactly that:

#include <cstring>
#include <unordered_set>;

// Hash for C strings.  I think I got this from boost.
class SimpleStringHash
{
public:
    size_t operator() ( const char *str ) const
    {
        size_t hash = 0;

        int c;
        while( c = *str++ )
            hash = c + ( hash << 6 ) + ( hash << 16 ) - hash;

        return hash;
    }
};

// Predicate for comparing strings by value.
class StringPredicate
{
public:
    bool operator() ( const char *a, const char *b ) const
    {
        return a == b || strcmp( a, b ) == 0;
    }
};

class Strings::Impl
{
    // A customized unordered_set that compares and hashes elements as if
    // they were strings and not pointers.
    typedef std::unordered_set< const char *, SimpleStringHash, StringPredicate > StringSet;

    StringSet internalizedStrings_;

public:
    ~Impl()
    {
        for( const char *str : internalizedStrings_ )
            delete[] str;
    }

    InternedString Internalize( const char *str )
    {
        if( str == nullptr )
            return InternedString( nullptr );

        StringSet::iterator i = internalizedStrings_.find( str );
        if( i == internalizedStrings_.end() )
        {
            // String not found so place it into the set.
            size_t length = strlen( str );
            char *internalized = new char[ length + 1 ];

            memcpy( internalized, str, length );
            internalized[ length ] = '\0';

            internalizedStrings_.insert( internalized );
            return InternedString( internalized );
        }

        return InternedString( *i );
    }
};

// Forward all calls to the PIMPL.
Strings::Strings() : impl_( new Impl() ) { }
Strings::~Strings() { delete impl_; }

InternedString Strings::Internalize( const char *str ) { return impl_->Internalize( str ); }

That’s all there is to internalizing strings. In a future post I will talk about how to use a memory pool to keep all strings allocated in the same memory area.

Share