Say that one day you decide that you’re tired of of sending the (0, 0, 0, 1) row vector to your GPU over and over again. The solution of course is to drop that row all together and use 3×4 matrices. Consider a uniform buffer object that looks like this:

layout( std140 ) uniform CommonUniforms
{
    mat4 projectionMatrix; // 4 uniform slots.
    mat4 cameraMatrix;     // 4 uniform slots.
    vec4 cameraPosition;   // 1 uniform slot.
};

This uniform buffer object will take up 9 uniform slots. InĀ GLSL matnxm is a matrix with n columns and m rows and we want our matrix to pre-multiply with vec4 so it needs to have 4 columns. So, let’s select mat4x3 as the replacement. The resulting buffer object becomes:

layout( std140 ) uniform CommonUniforms
{
    mat4 projectionMatrix; // 4 uniform slots.
    mat4x3 cameraMatrix;   // ? uniform slots.
    vec4 cameraPosition;   // 1 uniform slot.
};

Everything seems good and the shaders should still compile(perhaps with a few extra casts to vec4 in some places). You also go ahead and change your internal buffer representation in the code(I’m using structs of glm primitives). But alas, when the application runs the screen is completely blank! What happened? According to the specification a mat4x3 takes up 4 uniform buffer slots just like before so the whole buffer will still require 9 slots. That’s no good as nothing is saved.

Now it is obvious that the matrix type that we want is actually mat3x4. The problem with it is that even though it only needs 3 uniform slots it cannot pre-multiply a vec4. All operations M*v will need to be written as v*M to have the same result. If that is not a problem then we’re done at the cost of multiplication consistency.

Of course we’re not done; M*v looks way too good to give up so easily. The solution comes in the form of another layout qualifier:

layout( std140 ) uniform CommonUniforms
{
    mat4 projectionMatrix;                     // 4 uniform slots.
    layout( row_major ) mat4x3 cameraMatrix;   // 3 uniform slots.
    vec4 cameraPosition;                       // 1 uniform slot.
};

The final iteration behaves as expected. It uses 8 uniform slots and you can still multiply vectors on the right.